Central Institute of Indian Languages

Quickview

Thado Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | ThadoWords: 33759 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Thadoparallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Thado section includes 33759 words and 172320 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Yumnam Premila Chanu, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. ThadoParallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-10-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

The Mother Tongue Parallel Text Corpus of India Vol. I

requests (6)

The Mother Tongue Parallel Text Corpus of India Vol.IEnglish and 147 mother tongues of India | 5,332 sentencesThe Mother Tongue Parallel Text Corpus of India Vol.I comprising English and 147 mother tongues of India. Each corpus comprising a total of 5,332 sentences, systematically structured based on 152 grammatical categories. The parallel corpus contains the following languages:1.Assamese, 2.Bengali, 3.Bodo/Boro, 4.Dogri, 5.Gujarati, 6.Hindi, 7.Kannada, 8.Kashmiri, 9.Konkani, 10.Maithili, 11.Malayalam, 12.Manipuri, 13.Marathi, 14.Nepali, 15.Odia, 16.Punjabi, 17.Sanskrit, 18.Santhali, 19.Sindhi, 20.Tamil, 21.Telugu, 22.Urdu, 23.Anal, 24.Angami, 25.Apatani, 26.Are, 27.Awadhi, 28.Bagheli/BaghelKhandi, 29.Bagri, 30.Bagri Rajasthani, 31.Balti, 32.Bhadrawahi, 33.Bharmauri/Gaddi, 34.Bhojpuri, 35.BilaspuriKahluri, 36.Brajbhasha, 37.Bundeli/Bundelkhandi, 38.Chakru/Chokri, 39.Chambeali/Chamrali, 40.Chang, 41.Chhattisgarhi, 42.Chirr, 43.Chungli, 44.Churahi, 45.Coorgi/Kodagu, 46.Deori, 47.Dhundhari, 48.Dimasa, 49.Gangte, 50.Garhwali, 51.Garo, 52.Gujari, 53.Gujjari/Gujar/Gojri, 54.Halabi, 55.Handuri, 56.Hara/Harauti, 57.Haryanvi, 58.Hindi Multani, 59.Irula/IrularMozhi, 60.Kabui, 61.Kangri, 62.Kachchhi, 63.Karbi/Mikir, 64.Khandeshi, 65.KhariBoli, 66.Khasi, 67.Khezha, 68.Khiemnungan, 69.Khortha/Khotta, 70.Kisan, 71.Kodava, 72.Kokbarak, 73.Kolami, 74.Koli, 75.Kom, 76.Konda, 77.Konyak, 78.Koya, 79.Kudubi/Kudumbi, 80.Kuki, 81.KurmaliThar, 82.Ladakhi, 83.Lepcha, 84.Liangmei, 85.Limbu, 86.Lotha, 87.Lyngngam, 88.Magadhi/Magahi, 89.Malvi, 90.Mao, 91.Mara, 92.Maram, 93.Maring, 94.Mech/Mechhia, 95.Mewari, 96.Mewati, 97.Miri/Mishing, 98.Mishmi, 99.Mizo, 100.Mongsen, 101.Monpa, 102.Mundari, 103.Muwasi, 104.Nawait, 105.Nimadi, 106.Nissi, 107.Nocte, 108.Pahari, 109.Paite, 110.Palmuha, 111.Pania, 112.Paola, 113.Pawari/Powari, 114.Phom, 115.Pnar/Synteng, 116.Pochury, 117.Purkhi, 118.Rai, 119.Rajasthani, 120.Reang, 121.Rengma, 122.Rongmei, 123.Sadan/Sadri, 124.Sambalpuri, 125.Sangtam, 126.Saurashtra/Saurashtri, 127.Sema, 128.Shina, 129.Sirmauri, 130.Sugali, 131.Surjapuri, 132.Talgalo (galo), 133.Tangkhul, 134.Thado/Thadou, 135.Tibetan, 136.Tikhir, 137.Tripuri, 138.Tulu, 139.Vaiphei, 140.Wagdi, 141.Wancho, 142.Yimchungre, 143.Yerava, 144.Yerukala/Yerukula, 145.Zeliang, 146.Zemi, 147.Zou The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker. For any research-based citations, please use the following citations:Dr. Narayan Kumar Choudhary, Writtik Bhattacharya, Dr. Saritha S.L., Dr. Amudha R., Dr. Sajila S.,Dr. Satyaendra Kumar Awasthi, Dr. Rejitha K .S., Amom Nandaraj Meetei, Dr.Vijayalaxmi F Patil, Dr. Shahnawaz Alam, Yumnam Premila Chanu, Saurabh Varik, Dr. Mansoor Khan, Chetan Baji, Sonali Sutradhar, Umesh Chamling Rai, Bhageshree K Khandale, Dr. Zargar Adil Ahmad, Dr. Modugu Kasimbabu, Dr.Kamaraj S., Syeda Mustafiza Tamim, Dr. Shalinder Singh, Hemlata Daimary, Poulami Das, Shivangi Priya, Neha Dixit, Anand Jain, Abhishek Avtans, Akanksha Srivastava, Prangshu Manjul, Ankita Tiwari, Prof. Shailendra Mohan et.al. 2025. The Mother Tongue Parallel Text Corpus of India Vol. I Central Institute of Indian Languages, Mysore. 978-93-48633-08-8.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Tibetan Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Tibetan Words: 5497| 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Tibetan parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Tibetan section includes 5497 words and 176021 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Tibetan Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-68-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Tikhir Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Tikhir Words: 29964 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Tikhir parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Tikhir section includes 29964 words and 172906 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Tikhir Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-36-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Tripuri Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Tripuri Words: 32631 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Tripuri parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Tripuri section includes 32631 words and 162746 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Tripuri Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-21-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Tulu Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Tulu Words: 23,245 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Tulu parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Tulu section includes 23,245 words and 150,029 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Sajila S., Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Tulu Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-42-52. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Tulu Raw Text Corpus

requests (0)

8,16,073 | Words |619,0666 characters | 55 TitlesTulu has about 18.4 lakh (1.84 million) speakers. Although it has a rich literary and oral tradition, carries profound cultural and historical significance within the region of Tulunadu. Tulu is one of the ancient and culturally rich languages of South India, belonging to the Dravidian language family. Tulu is mainly spoken in the coastal region known as Tulu Nadu, covering Dakshina Kannada and Udupi districts of Karnataka and Kasargod district of Kerala. The Tulu Raw Text Corpus is an extensive repository encapsulating the viable linguistic elements of Tulu textual materials.Data has been collected from books.A detailed explanation of the Tulu Raw Text Corpus will be available in the Tulu Raw Text Corpus Documentation.For any research-based citations, please use the following citations:1.Dr. Sajila S., Dr. Narayan Choudhary, Prof. Shailendra Mohan 2026. Tulu Raw Text Corpus, Central Institute of Indian Languages, Mysore. ISBN: 978-81-69175-97-51. Narayan Choudhary. LDC-IL: The Indian repository of resources for language technology. Lang Resources & Evaluation 55, 855–867 (2021). https://doi.org/10.1007/s10579-020-09523-32. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Urdu Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Urdu Words: 34,329 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Urdu parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Urdu section includes 34,329 words and 1,47,375 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr Shahnawaz Alam, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Urdu Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-14-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Urdu Parts of Speech Annotated Corpus

requests (0)

1502213 Tags| 1374380 Words | 63184 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Urdu. The Urdu PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Urdu PoS annotated Corpus contains 1502213 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Dr. Mansoor Khan, Dr. Shahnawaz Alam, Bi Bi Mariyam, Dr. Narayan Choudhary. 2026. Urdu Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore.978-81-69175-51-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Urdu Raw Speech Corpus

requests (11)

99:18:21 Hours | 64.2 GB | 499 Speakers | 88,708 Audio Segments | 48 kHz | 16 bit wav. Urdu is one of the Modern Indo-Aryan languages of India. It evolved from Shaurseni Apabhramsha. It uses Persio-Arabic script. The language in a region is influenced by other languages of the region, mother tongue of the speaker, etc. The reading speed, loudness, frequency etc. also differ depending on certain factors like age, gender etc. Linguistic data consortium collected the speech corpus through fieldwork. This read data is collected from various age groups of male and female native speakers. This data includes Texts, Sentences, Date Formats, and different wordlists.The available Speech Corpus details: Total Speakers - 499 (252 Female and 247 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 431 25:35:02 Creative Text 433 19:40:11 Sentence 10,646 8:00:38 Date Format 846 0:43:37 Command and Control Words 13,580 9:21:01 Person Name 6,577 2:55:41 Place Name 4,273 1:09:17 Most Frequent Word - Part 12,802 7:46:28 Most Frequent Word - Full Set 18,927 11:38:30 Phonetically Balanced Vocabulary 13,646 8:13:20 Form and Function Word 6,547 4:14:36 A detailed explanation of the Urdu Speech Corpus will be available in the Urdu Speech Data Documentation.For any research based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam & Rushda Idris Khan. 2019. Urdu Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Urdu Sentence Aligned Speech Corpus

requests (4)

Dataset Description:50:09:56 hours | 32.3 GB | 32,384 Audio Segments | 434 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Urdu Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Persio-Arabic script. This dataset spans a duration of 50:09:56 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 214 female and 219 male native Urdu speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Urdu Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Urdu Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-87-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Vaiphei Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Vaiphei Words: 39424 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Vaiphei parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Vaiphei section includes 39424 words and 171508 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Vaiphei Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-99-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Wagdi Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | WagdiWords: 31,752 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Wagdiparallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Wagdi section includes 31,752 words and 1,30,757 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Chetan Baji, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Wagdi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-13-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Wancho Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Wancho Words: 34,611 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Wancho parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Wancho section includes 34,611 words and 167,197 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Syeda Mustafiza Tamim, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Wancho Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-46-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Yerava Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Yerava Words: 24453 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Yerava parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Yerava section includes 24453 words and 153083 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Saritha S.L., Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Yerava Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore.978-81-69175-40-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...