Central Institute of Indian Languages
Odia Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Odia Words: 26,141 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Odia parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Odia section includes 26,141 words and 147,972 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Sonali Sutradhar, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Odia Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-54-8.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Odia Parts of Speech Annotated Corpus
683112 Tags| 587653 Words | 53288 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Odia. The Odia PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Odia PoS annotated Corpus contains 683112 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Subhashree Mohanty, Dr. Narayan Choudhary, Rajesha N., Manasa G. 2026. Odia Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore.978-81-69175-88-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9...
Odia Raw Speech Corpus
138:06:18 hours | 89 GB | 474 Speakers | 73,418 Audio segments | 48 kHz | 16 bit wav.Odia is an Indo-Aryan language; which is mainly spoken in the state of Odisha and also in some of the border states like West Bengal, Jharkhand, Chhatisgarh and Andhra Pradesh. It is designated with Classical Language Status by the Govt. of India. The LDC-IL Odia speech data is collected from the Central and Northern parts of Odisha from both the genders and different age groups. This data consists of different types of datasets that are made up of word lists, sentences include running texts and date formats.The available Speech Corpus details:Total Speakers 474 (239 Female and 235 Male)DomainsAudio SegmentsEach DomainDurationContemporary Text (News)44942:49:56Creative Text45019:43:50Sentence11,2488:22:57Date Format9001:27:49Command and Control Words13,49914:18:49Person Name8,9985:01:40Place Name4,49613:22:45Most Frequent Word - Part8,9949:40:04Most Frequent Word - Full Set10,98910:21:04Phonetically Balanced10,43810:05:10Form and Function - Word2,9572:52:14A detailed explanation of the Bengali Speech Corpus will be available in the Odia Raw Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Raja Kumar Naik, Pramod Kumar Rout, Kshirod Kumar Das & Santosh Kumar Mohanty. 2021. Odia Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Pahari Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Pahari Words: 32,270 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Pahari parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Pahari section includes 32,270 words and 139,064 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Satyaendra Kumar Awasthi, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Pahari Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-84-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Paite Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Paite Words: 32648 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Paite parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Paite section includes 32648 words and 152920 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Paite Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-67-8.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Palmuha Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Palmuha Words: 31,398 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Palmuha parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Palmuha section includes 31,398 words and 136,462 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Vijayalaxmi F Patil, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Palmuha Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-94-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Pania Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Pania Words: 21731 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Pania parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Pania section includes 21731 words and 163599 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Saritha S.L., Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Pania Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore.978-81-69175-19-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Pawari/Powari Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Pawari/Powari Words: 31,903 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Pawari/Powari parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Pawari/Powari section includes 31,903 words and 140,675 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Vijayalaxmi F Patil, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Pawari/Powari Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-95-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Phom Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Phom Words: 26936 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Phom parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Phom section includes 26936 words and 159549 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Phom Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-26-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Pnar/Synteng Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Pnar/Synteng Words: 41968 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Pnar/Synteng parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Pnar/Synteng section includes 41968 words and 173704 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Pnar/Synteng Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-05-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Pochury Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Pochury Words: 31149 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Pochury parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Pochury section includes 31149 words and 166904 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Pochury Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-24-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Poula Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Poula Words: 30,306 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Poula parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Poula section includes 30,306 words and 153,054 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Prangshu Manjul, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Poula Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-57-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Punjabi Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Punjabi Words: 33,794 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Punjabi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Punjabi section includes 33,794 words and 155,362 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Shalinder Singh, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Punjabi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-89-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Punjabi Parts of Speech Annotated Corpus
1298034 Tags | 1150325 Words | 65452 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Punjabi. The Punjabi PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Punjabi PoS annotated Corpus contains 1298034 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Shalinder Singh, Dr. Narayan Choudhary, Rajesha N., Manasa G. 2026. Punjabi Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. 978-81-69175-64-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9...
Punjabi Raw Speech Corpus
101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230 Audio Segments | 48 kHz | 16 bit wav. Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. Here we are talking about only Indian Gurmukhi Punjabi. The Punjabi language has four different dialects, spoken in the different sub-regions of Punjab. The LDC-IL Punjabi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. LDC-IL collected speech data from Malwa, Doab and Puadh regions.The available Speech Corpus details:Total Speakers 467(234 Female and 233 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 448 27:07:41 Creative Text 446 19:29:15 Sentence 11,168 08:58:33 Date Format 887 00:27:53 Command and Control Words 13,274 07:49:16 Person Name 8,949 10:28:40 Place Name 4,473 03:17:02 Most Frequent Word - Part 8,889 05:21:56 Most Frequent Word - Full Set 3,988 02:52:44 Phonetically Balanced 13,939 08:56:04 Form and Function - Word 9,769 06:24:07 A detailed explanation of the Punjabi Speech Corpus will be available in the Punjabi Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon & Sarbjeet Kaur. 2019. Punjabi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
