Central Institute of Indian Languages
Shina Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Shina Words: 27,916 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Shina parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Shina section includes 27,916 words and 135,135 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Zargar Adil Ahmad, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Shina Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore.978-81-69175-70-82. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0..
Sindhi Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Sindhi Words: 33,053 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Sindhi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Sindhi section includes 33,053 words and 156,540 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Mansoor Khan, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Sindhi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-85-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Sirmauri Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Sirmauri Words: 30,358 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Sirmauri parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Sirmauri section includes 30,358 words and 144,588 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Mansoor Khan, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Sirmauri Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-25-8.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Sugali Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Sugali Words: 22,924 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Sugali parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Sugali section includes 22,924 words and 131,433 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Modugu Kasimbabu, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Sugali Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-23-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Surjapuri Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Surjapuri Words: 30,399 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Surjapuri parallel text corpus connected with Hindi and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Surjapuri section includes 30,399 words and 1,41,982 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Ankita Tiwari, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Surjapuri Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-27-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Talgalo Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Talgalo Words: 33074 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Talgalo parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Talgalo section includes 33074 words and 151095 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Talgalo Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-45-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Tamil Parts of Speech Annotated Corpus
2131256 Tags | 1750935 Words | 172089 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Tamil . The Tamil PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Tamil PoS annotated Corpus contains 2131256 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Dr. Amudha R, Dr. Kamaraj S, Dr. Prem Kumar L. R., Dr. Narayan Choudhary 2026. Tamil Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. 978-81-69175-98-22. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9...
Tamil Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Tamil Words: 22,965 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Tamil parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Tamil section includes 22,965 words and 192,135 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Amudha R., Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Tamil Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-82-12. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Tamil Raw Speech Corpus
139:11:41 Hours | 86 GB speech data | 452 Speakers | 60,287 Audio segments | 48 kHz | 16 bit wav. Tamil is one of the longest-surviving classical languages in the world. It is one of the prominent language among the Dravidian language family. Tamil is widely spoken in the state of Tamil Nadu, Union Territory of Pondicherry, Sri Lanka, in East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china, Fiji, in South-Africa, British Guinea and in islands like Mauritius and Madagascar etc. The language is an official language in Tamil Nadu and some of the foreign countries such as Sri Lanka and Singapore. It has official status in the Indian state of Tamil Nadu and the Indian Union Territory of Pondicherry. Tamil has its own font. The language is highly agglutinative in nature. Tamil has Phonological simplicity, Morphological parity and primitiveness. There is separability and significance of all affixes in Tamil language. There is an absence nominative case termination and arbitrary words in Tamil language. The LDC-IL speech data is collected from the regions of Kongu, Kumari, Madurai, Nellai, Salem and Thanjai, from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 452 (214 Female and 219 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 433 57:53:48 Creative Text 429 14:21:31 Sentence 10,764 14:51:03 Date Format 842 01:20:17 Command and Control Words 12,882 12:57:06 Person Name 8,755 03:57:29 Place Name 4,002 10:34:38 Most Frequent Word - Part 12,813 11:14:05 Most Frequent Word - Full Set 2,000 02:26:05 Phonetically Balanced 3,860 04:55:10 Form and Function - Word 3,507 04:40:29 A detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Thennarasu S, Prem Kumar L R, Amudha R, Prabagaran R, Srikanth D. 2021. Tamil Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Narayan Choudhary, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Tamil Sentence Aligned Speech Corpus
Dataset Description: 74:57:59 hours | 46.4 GB | 48,572 Audio Segments | 433 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Tamil Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Tamil script. This dataset spans a duration of 74:57:59 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 214 female and 219 male native Tamil speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Tamil Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Amudha R., Kamaraj S., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes,Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Tamil Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-26-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Tangkhul Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | TangkhulWords: 28405 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Tangkhulparallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Tangkhul section includes 28405 words and 194103 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Yumnam Premila Chanu, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. TangkhulParallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-81-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Telugu Parallel Text Corpus: Linguistic Features and Structures
Total Words: 4,404,845 | Telugu Words: 22,829 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Telugu parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Telugu section includes 22,829 words and 159,641 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Modugu Kasimbabu, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Telugu Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-69-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Telugu Parts of Speech Annotated Corpus
37840 Tags| 30119 Words | 3992 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Telugu. The Telugu PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Telugu PoS annotated Corpus contains 37840 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Dr. Modugu Kasimbabu, Dr. Narayan Choudhary, Rajesha N., Manasa G., 2026. Telugu Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore.978-81-69175-20-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9...
Telugu Raw Speech Corpus
22:43:59 Hours | 15 GB | 80 Speakers | 10,510 Audio Segments | 48 kHz | 16 bit wav. Telugu is the official language of Telangana and the Andhra Pradesh States. It belongs to the Dravidian language family. Among the Dravidian languages, Telugu is spoken by the largest population. Telugu is agglutinative in nature and its vocabulary is very much influenced by Sanskrit. LDC-IL considered Telugu has three specifically different varieties, thus collected speech data from Telangana, Rayalaseema and Coastal Andhra. The LDC-IL Telugu Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. Speech is in .wav format and Metadata is in .txt format.The available Speech Corpus details:Total Speakers 80 (24 Female and 56 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 77 8:28:19 Creative Text 77 7:01:16 Sentence 1,828 1:20:55 Date Format 142 0:13:58 Command and Control Words 2,170 1:43:49 Person Name 1,438 1:09:31 Place Name 707 0:33:24 Most Frequent Word - Part 2,162 1:31:24 Most Frequent Word - Full Set 1,909 0:41:23 A detailed explanation of the Telugu Speech Corpus will be available in the Telugu Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary & Rajesha N. 2019. Telugu Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Telugu Sentence Aligned Speech Corpus
15:38:53 hours | 10.1 GB | 9,548 Audio Segments | 80 Speakers The LDC-IL Telugu Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Telugu script. This dataset spans a duration of 15:38:53 hours (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 24 female and 56 male native Telugu speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Telugu Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:Dr. Modugu Kasimbabu, Kavitha Lenin, Rajesha N., Manasa G., Stephen Fernandes, Nithin S., Roopashri M. R., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Telugu Sentence Aligned Speech Corpus, Central Institute of Indian Languages, Mysore. 978-93-48633-04-0Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
