Central Institute of Indian Languages

Quickview

Manipuri Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Manipuri Words: 23963 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Manipuri parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Manipuri section includes 23963 words and 146608 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Yumnam Premila Chanu, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. Manipuri Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-34-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Manipuri Parts of Speech Annotated Corpus

requests (0)

543469 Tags| 449451 Words | 44619 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Manipuri. The Manipuri PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.Manipuri PoS annotated Corpus contains 543469 Part-of-Speech tags.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Yumnam Premila Chanu, Dr. Narayan Choudhary. Rajesha N.. 2026. Manipuri Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. 978-81-69175-43-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0..

Quickview

Manipuri Raw Speech Corpus

requests (13)

156:28:32 hours | 100 GB | 620 Speakers | 66,231 Audio segments | 48 khz | 16 bit wav Manipuri is the Administrative Language of Manipur. The development of LDC-IL Speech Data for Manipuri lies in capturing all the distinctive characteristics of speeches shared by different regional dialects of Manipur. In order to do so, certain linguistic features identifying regional tones and intonations, phonemic distributions, various pronunciations reflected in both regional and non-regional vocabulary items such as person names and place names etc., have been well housed based on a standard parameter of the dataset. Out of the entire dataset, each specific subset to be read by the corresponding speaker is randomly generated for ‘a read speech corpus’. In this way, each random set is read by a speaker. Limited Full Sets are made read completely by assured selected speakers in each age group. The data is collected from three regional dialects, namely Imphal, Kakching, and Awang Sekmai respectively through fieldwork. The age group ranges selected for fieldwork are ‘16 to 20’, ‘21 to 50’, and ‘above 50 years’ respectively. Equal number of male and female data is collected from each age group. The available Speech Corpus details : Total Speakers620(310 Female and 310 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 530 59:47:22 Creative Text 588 53:59:03 Sentence 10,979 10:01:41 Date Format 866 01:12:04 Command and Control Words 13,129 08:00:02 Person Name 8,789 07:14:04 Place Name 4,394 02:46:29 Most Frequent Word - Part 13,167 06:48:50 Most Frequent Word - Full Set 6,992 02:48:42 Phonetically Balanced 4,518 02:25:53 Form and Function - Word 2,279 01:23:50 A detailed explanation of the Manipuri Speech corpus will be available in the Manipuri Raw Speech Corpus Documentation.For any research based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu & Longjam Anand Singh. 2019. Manipuri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Manipuri Sentence Aligned Speech Corpus (Bengali Script)

requests (1)

116:34:24 hours | 75.9 GB | 60,819 Audio Segments | 589 speakersThe LDC-Manipuri Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing orthographically normalized annotation in Bengali script. This dataset spans a duration of 116:34:24 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 295 female and 294 male native Manipuri speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Manipuri Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:Amom Nandaraj Meetei, Yumnam ,Premila Chanu, Rajesha N, Manasa,G, Stephen Fernandes, Nithin S, Roopashri M.R ,Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Manipuri Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-68-2..

Quickview

Manipuri Sentence Aligned Speech Corpus (Meetei Mayek)

requests (1)

116:34:24 hours | 75.9 GB | 60,819 Audio Segments | 589 speakersThe LDC-Manipuri Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing orthographically normalized annotation in Meetei Mayek. This dataset spans a duration of 116:34:24 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 295female and 294 male native Manipuri speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Manipuri Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:Amom Nandaraj Meetei, Yumnam,Premila Chanu, Rajesha N., Manasa,G., Stephen Fernandes, Nithin S.,Roopashri M.R.,Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Manipuri Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-96-5..

Quickview

Mao Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Mao Words: 28802 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Mao parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Mao section includes 28802 words and 163112 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Mao Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69099-51-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Mara Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Mara Words: 35083 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Mara parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Mara section includes 35083 words and 171595 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Mara Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69099-97-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Maram Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Maram Words: 31139| 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Maram parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Maram section includes 31139 words and 181367 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Maram Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69099-32-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Marathi Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Marathi Words: 26,327 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Marathi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Marathi section includes 26,327words and 148,721 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Bhageshree K. Khandale, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Marathi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-82-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Marathi Raw Speech Corpus

requests (23)

89:17:25 Hours | 58 GB speech data | 307 Speakers | 58544 Audio segments | 48 kHz | 16 bit wav.The Marathi language is an Indo-Aryan language. The Marathi language is prevalent in the 9th century. Standard Marathi (Puneri) is the official language of the State of Maharashtra. Standard Marathi is based on dialects used by academics and the print media. It is believed that the language of the Marathi language is influenced by Sanskrit. Marathi is written in the Devanagari script. The phoneme inventory of Marathi is similar to that of many other Indo-Aryan languages. The LDC-IL speech data is collected from the regions of Marathwada, Puneri, Vidharbh, and Goa from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 307 (156 Female and 151 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 302 22:26:06 Creative Text 302 13:37:34 Sentence 7,555 6:49:58 Date Format 604 0:39:57 Command and Control Words 9,068 7:50:10 Person Name 6,058 7:44:56 Place Name 3,037 2:49:32 Most Frequent Word - Part 9,104 7:22:57 Most Frequent Word - Full Set 10,987 9:53:28 Phonetically Balanced 4,609 4:10:47 Form and Function - Word 6,918 5:52:00 A detailed explanation of the Marathi Speech Corpus will be available in the Marathi Speech Data Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Gajanan R Apine & Apurva P Betkekar. 2019. Marathi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Marathi Sentence Aligned Speech Corpus

requests (9)

Dataset Description: 41:34:04 hours | 26.7 GB | 23,234 Audio Segments | 302 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Marathi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 89:17:25 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 153 female and 149 male native Marathi speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Marathi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Bhageshree K Khandale, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Marathi Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-92-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Maring Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Maring Words: 25073 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Maring parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Maring section includes 25073 words and 160556 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Maring Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69099-45-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Mech/Mechhia Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Mech/Mechhia Words: 24,707 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Mech/Mechhia parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Mech/Mechhia section includes 24,707 words and 163,754 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Hemlata Daimary, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Mech/Mechhia Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-39-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Mewari Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Mewari Words: 31,252 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Mewari parallel text corpus connected with Hindi and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Mewari section includes 31,252 words and 1,39,194 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Ankita Tiwari, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Mewari Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-55-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Mewati Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Mewati Words: 31,268 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Mewati parallel text corpus connected with Hindi and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Mewati section includes 31,268 words and 1,44,110 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Ankita Tiwari, Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Mewati Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-61-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...