Central Institute of Indian Languages

Quickview

Karbi/Mikir Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Karbi/Mikir Words: 30,400 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Karbi/Mikir parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Karbi/Mikir section includes 30,400 words and 183,861 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Karbi/Mikir Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-41-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Kashmiri Parts of Speech Annotated Corpus

requests (0)

‎103488‎ Tags| ‎92746‎ Words | ‎5331‎ SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Kashmiri. The Kashmiri PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Kashmiri PoS annotated Corpus contains ‎103488 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Dr. Zargar Adil Ahmad, Dr. Narayan Choudhary, ‎Rajesha N., Manasa G. 2026. Kashmiri Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. ‎978-81-69175-77-7. ‎2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9.‎103488‎ Tags| ‎92746‎ Words | ‎5331‎ SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Kashmiri. The Kashmiri PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Kashmiri PoS annotated Corpus contains ‎103488 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Dr. Zargar Adil Ahmad, Dr. Narayan Choudhary, ‎Rajesha N., Manasa G. 2026. Kashmiri Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. ‎978-81-69175-77-7. ‎2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9...

Quickview

Kashmiri Raw Speech Corpus

requests (15)

28:10:07 Hours | 18 GB speech data | 150 Speakers | 16,380 Audio segments | 48 kHz | 16 bit wav. Kashmiri Language belongs to Dardic group of Indo-Aryan family. It is known by names ‘Kashur’ and‘Kashmiri’. It is primarily spoken in Kashmir valley and Pir-Panchal range of Jammu region. Kashmiri language has two types of dialects i.e., regional dialects and social dialects. Apart from the Kashmiri spoken in valley itself there are other varieties of language that are spoken outside the valley and those varieties are considered as regional dialects of Kashmiri language. These regional dialects consist of Kishtawari, Poguli and Rambani. Kashmiri language has three social dialects as well which are known by the names Yamraz, Marak and Kamraz. The LDC-IL speech data is collected from Kashmiri Valley are from Pulwama, Srinagar, and Anantnag. This data is collected from both the genders at different age groups. The LDC-IL Kashmiri Speech data consists of different types of datasets that are made up of words, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 150 (78 Female and 72 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 147 3:56:57 Creative Text 148 12:41:33 Sentence 3704 2:40:24 Date Format 281 0:10:36 Command and Control Words 4288 3:04:32 Person Name 2065 1:53:21 Place Name 1468 1:04:37 Most Frequent Word - Part 4279 2:38:07 A detailed explanation of Kashmiri Speech Corpus will be available in the Kashmiri Speech Data Documentation. For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Shahid Mushtaq Bhatt, Rajesha N., Manasa G., 2021. Kashmiri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Kashmiri‎ Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Kashmiri‎ Words: ‎30,820‎ ‎ | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kashmiri‎ parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kashmiri‎ section includes ‎30,820‎ ‎words and ‎163,222‎ characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Zargar Adil Ahmad, Dr. Rejitha K. S., ‎Dr. Narayan Choudhary, Prof. Shailendra ‎Mohan‎. 2026. Kashmiri‎ Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-14-12. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0..

Quickview

Khandeshi Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Khandeshi Words: 27,088 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Khandeshi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Khandeshi section includes 27,088 words and 153,178 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Mr. Saurabh Varik, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Khandeshi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-20-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Khari Boli Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Khari Boli Words: 31,931 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Khari Boli parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Khari Boli section includes 31,931 words and 142,150 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Satyaendra Kumar Awasthi, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Khari Boli Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-12-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Khasi Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Khasi Words: 41762 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Khasi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Khasi section includes 41762 words and 184587 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Yumnam Premila Chanu, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. Khasi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-90-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Khezha Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 |Khezha Words: 29611 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Khezha parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Khezha section includes 29611 words and 150008 characters. Overall, the corpus comprises 44,04,845 words (over 4.4 million tokens) and 2,33,74,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Kamaraj S, Dr. Rejitha K. S, Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Khezha Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-03-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Khiemnungan Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 |Khiemnungan Words: 30359| 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Khiemnungan parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Khiemnungan section includes 30359 words and 157110 characters. Overall, the corpus comprises 44,04,845 words (over 4.4 million tokens) and 2,33,74,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Kamaraj S, Dr. Rejitha K. S, Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Khiemnungan Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-84-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Khortha/Khotta Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Khortha/Khotta Words: 31,131 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Khortha/Khotta parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Khortha/Khotta section includes 31,131 words and 141,319 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Satyaendra Kumar Awasthi, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Khortha/Khotta Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-22-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Kisan Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Kisan Words: 26,035 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kisan parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kisan section includes 26,035 words and 179,192 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Sonali Sutradhar, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Kisan Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-21-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Kodava Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Kodava Words: 25,044 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kodava parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kodava section includes 25,044 words and 166,544 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Sajila S., Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Kodava Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-62-22. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Kokbarak Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Kokbarak Words: 27,187 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kokbarak parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kokbarak section includes 27,187 words and 175,194 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Kokbarak Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-48-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Kolami Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Kolami Words: 23,108 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Kolami parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Kolami section includes 23,108 words and 153,502 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Modugu Kasimbabu, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Kolami Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-64-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Koli Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Koli Words: 27,388 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Koli parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Koli section includes 27,388 words and 147,969 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Bhageshree K. Khandale, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Koli Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-67-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...