179:32:52 hours of 115 Gigabytes speech data | 656 Speakers | 99109 Audio segments | 48 kHz | 16 bit wav
Kannada is one of the
Ancient Indian languages which belong to the Dravidian family. It has its own
script. The language in a region is influenced by other languages of the
region, the mother tongue of the speaker, etc. The reading speed, loudness,
frequency etc also differ depending on certain factors like age, gender etc.
Linguistic data consortium identified four regional dialects and collected the
speech corpus through fieldwork. This read data is collected from various age
groups, of male and female native speakers in equal number. This data includes
Texts, Sentences, Date Formats, and different wordlists.
The available Speech Corpus details are as follows.
- Total Speakers - 656 (328 Female and 328 Male)
- Contemporary Text (News) - 600 Audio Segments - 66:06:09 Hours
- Creative Text - 600 Audio Segments - 33:09:20 Hours
- Sentence - 14887 Audio Segments - 13:58:15 Hours
- Date Format - 1200 Audio Segments - 1:16:22 Hours
- Command and Control Words - 17988 Audio Segments - 12:31:43 Hours
- Person Name - 12009 Audio Segments - 13:04:49 Hours
- Place Noun - 6032 Audio Segments - 4:48:42 Hours
- Most Frequent Word-Part - 18065 Audio Segments - 12:21:24 Hours
- Most Frequent Word-Full Set - 8000 Audio Segments - 6:45:56 Hours
- Phonetically Balanced - 9360 Audio Segments - 6:47:23 Hours
- Form and Function- Word - 10368 Audio Segments - 8:42:49 Hours
A much more detailed explanation of the Kannada Speech Corpus will be available in the Kannada Speech Data Documentation.
For any research based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant
Baji, Malini N. Abhyankar, Rajesha N.
& Manasa G. 2019. Kannada Raw Speech Corpus.
Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: ” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
- Authors Ramamoorthy L., Narayan Choudhary, Vijayalaxmi F Patil, Chetan Suryakant Baji, Rajesha N., Manasa G, Sunitha Rajendra, Reshma S, Kavitha L, Malini N. Abhyankar
- Corpus Type Raw Corpus
- Catalogue Number 1129
- ISBN 978-81-7343-228-6
- Data Source On Field
- Duration 179:32:52
- # of Audio Segments 99109
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.