Konkani Raw Speech Corpus

Konkani Raw Speech Corpus

0 reviews requests (4)
Catalogue Number: A-020
Stock In Stock
Please Login to see the price

Dataset Description

156:37:51 hours of 100 Gigabytes speech data | 504 Speakers | 72,938 Audio segments | 48 kHz | 16 bit wav

Konkani belonging to the Indo-European family of languages. Konkani is the official language of Goa. However, the language is spoken widely across four states- Maharashtra, Goa, Karnataka and Kerala. Konkani is the only Indian language written in five different scripts - Devanagari, Roman, Kannada, Malayalam and Persian-Arabic. The LDC-IL speech data is collected from the regions of North Goa, South Goa, Karwar (Karnataka) and Sindhudurgh (Maharastra) from both the genders and different age group.

The LDC-IL Konkani Speech data set consists of different types of datasets that are made up of word lists, sentences running texts and date formats.


The available Speech Corpus details for Konkani are as follows.

Total of 504 speakers (267 Female and 237 Male)

    • Contemporary Text (News) - 477 Audio Segments 49:52:09 Hours
    • Created Text - 480 Audio Segments - 22:09:05 Hours
    • Sentence - 12050 Audio Segments - 15:51:11 Hours
    • Date Format - 953 Audio Segments - 01:50:39 Hours
    • Command and Control Words - 14944  Audio Segments - 16:11:02 Hours
    • Person Name - 9588 Audio Segments - 15:55:43 Hours
    • Place Name - 4812 Audio Segments - 05:31:03 Hours
    • Most Frequent Word-Part - 9104 Audio Segments - 7:22:57 Hours
    • Most Frequent Word-Full Set - 10987 Audio Segments - 9:53:28 Hours
    • Phonetically Balanced - 2975 Audio Segments - 02:49:36 Hours
    • Form and Function Word - 4285 Audio Segments - 04:29:03 Hours 

  

A much more detailed explanation of the Konkani Speech Corpus will be available in the Konkai Speech Data Documentation. 

For any research based citations, please use the following citations:

  • Ramamoorthy, L., Narayan Choudhary, Saurabh Varik  & Rashmi Shet Tanawade. 2019. Konkani Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174.

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Saurabh Varik, Bhageshree Khandale, Rashmi S. Shet Tanawade, Yashwant D. Gawas
  • Corpus Type Raw Corpus
  • Catalogue Number 1135
  • ISBN 978-81-7343-234-7
  • Data Source On Field
  • Duration 156:37:51
  • # of Audio Segments 72938
  • Release Date 04/04/2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User

Write a review

Please login or register to review