Search
Kannada Raw Speech Corpus
179:32:52 hours of 115 GB | 656 Speakers | 99109 Audio segments | 48 kHz | 16 bit wavKannada is one of the Ancient Indian languages belong to the Dravidian family. It has its own script. The language in a region is influenced by other languages of the region, the mother tongue of the speaker, etc. The reading speed, loudness, frequency etc also differ depending on certain factors like age, gender, etc. Linguistic data consortium identified four regional dialects and collected the speech corpus through fieldwork. This read data is collected from various age groups, of male and female native speakers in equal numbers. This data includes Texts, Sentences, Date Formats, and different wordlists. The available Speech Corpus details: Total Speakers - 656 (328 Female and 328 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 600 66:06:09 Creative Text 600 33:09:20 Sentence 14,887 13:58:15 Date Format 1,200 1:16:22 Command and Control Words 17,988 12:31:43 Person Name 12,009 13:04:49 Place Name 6,032 4:48:42 Most Frequent Word - Part 18,065 12:21:24 Most Frequent Word - Full Set 8,000 02:08:58 Phonetically Balanced 9,360 02:40:58 Form and Function - Word 10,368 03:14:38 A detailed explanation of the Kannada Speech Corpus will be available in the Kannada Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N. Abhyankar, Rajesha N. & Manasa G. 2019. Kannada Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
A Gold Standard Kannada Raw Text Corpus
77,63,124 words | 1772 Titles | Data and Metadata in XML format | 6 text domainsKannada is one of the Ancient Indian language which belongs to the Dravidian family. It has its own script. Even though Kannada is considered as a classical language because of its ancient history in literature, the Kannada text corpus is extracted from contemporary text sources. To keep the corpus balanced, the Kannada text corpus is collected by keying-in and proofing text extracts from books of various domains or Crawled from News websites. The available corpus is in Unicode standard and the data with metadata is in XML format. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 37,78,723 48.68 % Commerce 2,07,053 2.67 % Mass Media 2,07,053 34.54 % Official Document 5,357 0.07 % Science and Technology 2,43,166 3.13 % Social Sciences 8,47,214 10.91 % A detailed explanation of the Kannada Text Corpus will be available in the Kannada Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N Abhyankar, Rajesha N. & Manasa G. 2019. A Gold Standard Kannada Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...