Central Institute of Indian Languages
Kom Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Kom Words: 34,842 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kom parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kom section includes 34,842 words and 193,818 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Prangshu Manjul, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Kom Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-30-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Konda Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Konda Words: 24,549 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Konda parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Konda section includes 24,549 words and 161,416 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Modugu Kasimbabu, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Konda Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-91-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Konkani Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Konkani Words: 25,384 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Konkani parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Konkani section includes 25,384 words and 144,901 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Mr. Saurabh Varik, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Konkani Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-83-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Konkani Parts of Speech Annotated Corpus
569408 Tags| 468967 Words | 53161 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Konkani. The Konkani PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Konkani PoS annotated Corpus contains 569408 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Mr. Saurabh Varik, Dr. Narayan Choudhary. 2026. Konkani Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. 978-81-69175-91-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9...
Konkani Raw Speech Corpus
156:37:51 Hours | 100 GB | 504 Speakers | 72,938 Audio Segments | 48 kHz | 16 bit wav. Konkani belongs to the Indo-European family of languages. Konkani is the official language of Goa. However, the language is spoken widely across four states- Maharashtra, Goa, Karnataka and Kerala. Konkani is the only Indian language written in five different scripts - Devanagari, Roman, Kannada, Malayalam, and Persian-Arabic. The LDC-IL speech data is collected from the regions of North Goa, South Goa, Karwar (Karnataka) and Sindhudurgh (Maharastra) from both genders and different age groups.Approximately 15 to 20 minutes of speech (per speaker) taken from 267 female and 237 male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 504 (267 Female and 237 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 477 49:52:09 Creative Text 480 22:09:05 Sentence 12,050 15:51:11 Date Format 953 01:50:39 Command and Control Words 14,944 16:11:02 Person Name 9,588 15:55:43 Place Name 4,812 05:31:03 Most Frequent Word - Part 16,376 16:03:13 Most Frequent Word - Full Set 5,998 05:55:07 Phonetically Balanced 2,975 02:49:36 Form and Function - Word 4,285 04:29:03 A detailed explanation of the Konkani Speech Corpus will be available in the Konkani Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Saurabh Varik & Rashmi Shet Tanawade. 2019. Konkani Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Konkani Sentence Aligned Speech Corpus
Dataset Description: 83:19:42 hours | 53.5 GB | 34,091 Audio Segments | 487 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Konkani Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 83:19:42 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 259 female and 228 male native Konkani speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Konkani Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Saurabh Varik, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Konkani Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-62-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Konyak Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Konyak Words: 34,232 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Konyak parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Konyak section includes 34,232 words and 190,414 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Konyak Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-71-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Koya Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Koya Words: 23,035 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Koya parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Koya section includes 23,035 words and 162,504 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Modugu Kasimbabu, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Koya Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-44-8.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Kudubi/Kudumbi Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Kudubi/Kudumbi Words: 24,371 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kudubi/Kudumbi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kudubi/Kudumbi section includes 24,371 words and 148,209 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Mr. Saurabh Varik, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Kudubi/Kudumbi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-10-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Kuki Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Kuki Words: 32704 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kuki parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kuki section includes 32704 words and 174724 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Kuki Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69099-19-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Kurmali Thar Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Kurmali Thar Words: 28,825 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kurmali Thar parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kurmali Thar section includes 28,825 words and 1,48,514 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Shahnawaz Alam, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Kurmali Thar Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-50-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Ladakhi Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Ladakhi Words: 5611 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Ladakhi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Ladakhi section includes 5611 words and 155,372 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Ladakhi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-52-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Lepcha Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Lepcha Words: 35,379 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Lepcha parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Lepcha section includes 35,379 words and 182,292 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Lepcha Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-28-8.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Liangmei Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 |Liangmei Words: 30096| 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Liangmei parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Liangmei section includes 30096 words and 168701 characters. Overall, the corpus comprises 44,04,845 words (over 4.4 million tokens) and 2,33,74,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Kamaraj S, Dr. Rejitha K. S, Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Liangmei Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-25-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Limbu Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Limbu Words: 23,296 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Limbu parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Limbu section includes 23,296 words and 164,517 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Limbu Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-70-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0.Total Words: 4,404,845 | Limbu Words: 23,296 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Limbu parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Limbu section includes 23,296 words and 164,517 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Limbu Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore.978-81-69099-70-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
