Central Institute of Indian Languages

Quickview

Lotha Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 |Lotha Words:31952| 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Lotha parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Lotha section includes 31952 words and 171180 characters. Overall, the corpus comprises 44,04,845 words (over 4.4 million tokens) and 2,33,74,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Kamaraj S, Dr. Rejitha K. S, Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Lotha Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-58-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Lyngngam Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Lyngngam Words: 36398 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Lyngngam parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Lyngngam section includes 36398 words and 172650 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Lyngngam Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69099-81-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Magadhi/Magahi Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Magadhi/Magahi Words: 32,616 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Magadhi/Magahi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Magadhi/Magahi section includes 32,616 words and 149,260 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Sonali Sutradhar, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Magadhi/Magahi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-38-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Maithili Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Maithili Words: 30,959 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Maithili parallel text corpus connected with Hindi and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Maithili section includes 30,959 words and 1,41,773 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Ankita Tiwari, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Maithili Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-09-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Maithili Parts of Speech Annotated Corpus

requests (0)

427438 Tags| 371983 Words | 25431 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Maithili. The Maithili PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Maithili PoS annotated Corpus contains 427438 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Dinesh Mishra, Dr. Narayan Choudhary, Rajesha N., Manasa G. 2026. Maithili Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore- 978-81-69175-29-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Maithili Raw Speech Corpus

requests (20)

78:45:33 Hours | 49.2 GB | 306 Speakers | 45,198 Audio Segments | 48 kHz | 16 bit wavMaithili is an Indio-Aryan language, a direct descendant of Sanskrit, which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled languages of India. The LDC-IL speech data is collected from geographic dialects of Sotipura, Bajjika and Thethi dialects. It is collected from both genders and of different age groups.The available Speech Corpus details:Total Speakers 306 (150 Female and 156 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)29122:33:41Creative Text29415:34:55Sentence7,45107:08:48Date Format58500:31:41Command and Control Words8,92407:07:34Person Name5,91707:49:33Place Name2,95202:47:49Most Frequent Word - Part8,69906:56:24Most Frequent Words-FullSet5,99604:58:30Phonetically Balanced Words3,04002:26:27Form and Function Words1,04900:50:11 A detailed explanation of the Maithili Speech Corpus will be available in the Maithili Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh, Dinesh Mishra & Atuleshwar Jha. 2019. Maithili Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Maithili Raw Speech Corpus Vol. II

requests (3)

109:09:50 hours | 206 Audio Segments | 122 SpeakersThe LDC-IL Maithili Raw Speech dataset Vol.II comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 109:09:50 (hh:mm:ss) , consisting of read speech with continuous text, and spontaneous speech along with the its transcription in Devnagari. The data is derived from 49 female and 73 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Maithili Raw Speech Documentation.For any research-based citations, please use the following citations:Shantanu Kumar, Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Maithili Raw Speech Corpus Vol. II. Central Institute of Indian Languages, Mysore. 978-93-48633-37-8. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Maithili Sentence Aligned Speech Corpus

requests (10)

Dataset Description: 41:54:30 hours | 26 GB | 21,412 Audio Segments | 300 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Maithili Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 41:54:30 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 147 female and 153 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Maithili Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Shantanu Kumar, Dinesh Mishra, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Maithili Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-96-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Maithili Sentence Aligned Speech Corpus (Tirhuta Script)

requests (2)

41:54:30 hours | 26 GB | 21,412 Audio Segments | 300 speakers The LDC-IL Maithili Sentence Aligned Speech Corpus(Tirhuta Script) dataset comprises audio files in wav format, accompanied by a corresponding textual layer containingphonetically normalized and orthographically normalized annotations inTirhuta Script. This dataset spans a duration of 41:54:30(hh:mm:ss) , consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 147 female and 153 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can befound in the The LDC-IL Maithili Sentence Aligned Speech Corpus(Tirhuta Script) Documentation.For any research-based citations, please use the following citations: Dinesh Mishra, Shantanu Kumar, Dr. Narayan Kumar Choudhary, Rajesha N., Prof. Shailendra Mohan. Maithili Sentence Aligned Speech Corpus(Tirhuta Script). Central Instituteof Indian Languages, Mysore. 978-93-48633-51-4Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0..

Quickview

Maithili Text to Speech Corpus

requests (2)

30:59:20 hours | 19.56 GB | 32260 Audio Segments | 2 SpeakersThe LDC-IL Maithili Text to Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer in Devanagari script. This dataset spans a duration of 32:42:20 (hh:mm:ss) , consisting of read speech in the studio setup. The data is derived from 01 female and 01 male native Maithili speakers. A comprehensive explanation of dataset can be found in the Maithili Text to Speech Documentation.For any research-based citations, please use the following citations:Shantanu Kumar, Dinesh Mishra, Saurabh Varik, Stephen Fernandes, Nithin S., Roopashri M. R., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Maithili Text to Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-36-1.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Malayalam Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Malayalam Words: 20,955 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Malayalam parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Malayalam section includes 20,955 words and 1,68,051 characters. Overall, the corpus comprises 44,04,845 words (over 4.4 million tokens) and 2,33,74,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Malayalam Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-65-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Malayalam Parts of Speech Annotated Corpus

requests (0)

1322728 Tags| 1089199 Words | 149781 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Malayalam. The Malayalam PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Malayalam PoS annotated Corpus contains 1322728 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Dr. Rejitha K. S., Dr. Saritha S. L., Dr. Sajila S., Dr. Narayan Choudhary. 2026. Malayalam Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. 978-81-69175-34-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9...

Quickview

Malayalam Raw Speech Corpus

requests (21)

164:01:02 Hours | 105 GB | 458Speakers| 43670 Audio Segments |48 kHz | 16 bit wav.Malayalam is the official language of Kerala and Laccadive Islands. It belongs to the Dravidian language family. According to the formation of Kerala and the language of Travancore, Cochin, and Malabar regions are influenced by different internal and external factors so LDC-IL considered Malayalam has three specifically different varieties, thus collected speech data from Thiruvananthapuram, Ernakulam, and Kozhikode. LDC-IL has 164 hours Malayalam speech data. The LDC-IL Malayalam Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 231 female and 227 Male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 458(231 Female and 227 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 449 71:29:21 Creative Text 449 54:41:20 Sentence 7,452 06:56:46 Date Format 598 00:53:45 Command and Control Words 8,923 07:09:37 Person Name 5,819 05:26:33 Place Name 2,906 02:28:24 Most Frequent Word - Part 8,763 06:51:31 Most Frequent Word - Full Set 1,979 02:08:58 Phonetically Balanced 3,096 02:40:09 Form and Function - Word 3,236 03:14:38 A detailed explanation of the Malayalam Speech Corpus will be available in the Malayalam Speech Data Documentation.For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Saritha S.L., Rejitha K.S., Sajila S. & Midhun P. G. 2019. Malayalam Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Malayalam Sentence Aligned Speech Corpus

requests (8)

Dataset Description: 123:29:55 hours | 79.6 GB | 89,269 Audio Segments | 451 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Malayalam Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Malayalam script. This dataset spans a duration of 123:29:55 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 229 female and 222 male native Malayalam speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Malayalam Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Sajila S., Saritha S.L., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Malayalam Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-58-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Malvi Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Malvi Words: 32,454 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Malvi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Malvi section includes 32,454 words and 1,42,183 characters. Overall, the corpus comprises 44,04,845 words (over 4.4 million tokens) and 2,33,74,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Malvi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-07-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...