Search
Indian English-Kannada Sentence Aligned Speech Corpus
Dataset Description:11:17:40 hours | 7.27 GB | 6,166 Audio Segments | 53 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Indian English-Kannada variant Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing Roman script. This dataset spans a duration of 11:17:40 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 26 female and 27 male native Kannada speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Indian English-Kannada variant Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Vijayalaxmi F. Patil, Rajesha N., Manasa G., Srikanth D., Nithin S.,Narayan Kumar Choudhary, Shailendra Mohan. 2023 Indian English-Kannada variant Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-35-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Kannada Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Kannada Words: 21,972 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kannada parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kannada section includes 21,972 words and 167,366 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Vijayalaxmi F Patil, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Kannada Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-74-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Kannada Parts of Speech Annotated Corpus
769767 Tags| 642885 Words | 66113 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Kannada. The Kannada PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Kannada PoS annotated Corpus contains 1322728 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Dr. Vijayalaxmi F Patil, Chetan Baji, Dr. Narayan Choudhary, Rajesha N., 2026. Kannada Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. 978-81-69175-32-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9...
Kannada Raw Speech Corpus
179:32:52 hours of 115 GB | 656 Speakers | 99109 Audio segments | 48 kHz | 16 bit wavKannada is one of the Ancient Indian languages belong to the Dravidian family. It has its own script. The language in a region is influenced by other languages of the region, the mother tongue of the speaker, etc. The reading speed, loudness, frequency etc also differ depending on certain factors like age, gender, etc. Linguistic data consortium identified four regional dialects and collected the speech corpus through fieldwork. This read data is collected from various age groups, of male and female native speakers in equal numbers. This data includes Texts, Sentences, Date Formats, and different wordlists. The available Speech Corpus details: Total Speakers - 656 (328 Female and 328 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 600 66:06:09 Creative Text 600 33:09:20 Sentence 14,887 13:58:15 Date Format 1,200 1:16:22 Command and Control Words 17,988 12:31:43 Person Name 12,009 13:04:49 Place Name 6,032 4:48:42 Most Frequent Word - Part 18,065 12:21:24 Most Frequent Word - Full Set 8,000 02:08:58 Phonetically Balanced 9,360 02:40:58 Form and Function - Word 10,368 03:14:38 A detailed explanation of the Kannada Speech Corpus will be available in the Kannada Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N. Abhyankar, Rajesha N. & Manasa G. 2019. Kannada Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Kannada Sentence Aligned Speech Corpus
Dataset Description: 107:48:50 hours | 69.4 GB | 65,533 Audio Segments | 600 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Kannada Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Kannada script. This dataset spans a duration of 107:48:50 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 300 female and 300 male native Kannada speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Kannada Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Vijayalaxmi F. Patil, Chetan Baji, Kavitha Lenin, Reshma S., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Kannada Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-19-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
A Gold Standard Kannada Raw Text Corpus
77,63,124 words | 1772 Titles | Data and Metadata in XML format | 6 text domainsKannada is one of the Ancient Indian language which belongs to the Dravidian family. It has its own script. Even though Kannada is considered as a classical language because of its ancient history in literature, the Kannada text corpus is extracted from contemporary text sources. To keep the corpus balanced, the Kannada text corpus is collected by keying-in and proofing text extracts from books of various domains or Crawled from News websites. The available corpus is in Unicode standard and the data with metadata is in XML format. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 37,78,723 48.68 % Commerce 2,07,053 2.67 % Mass Media 2,07,053 34.54 % Official Document 5,357 0.07 % Science and Technology 2,43,166 3.13 % Social Sciences 8,47,214 10.91 % A detailed explanation of the Kannada Text Corpus will be available in the Kannada Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N Abhyankar, Rajesha N. & Manasa G. 2019. A Gold Standard Kannada Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...
