Central Institute of Indian Languages

Quickview

Yerukala/Yerukula Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Yerukala/Yerukula Words: 22,375 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Yerukala/Yerukula parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Yerukala/Yerukula section includes 22,375 words and 148,657 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Amudha R., Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Yerukala/Yerukula Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-65-42. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Yimchungre Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 |Yimchungre Words: 28514 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Yimchungre parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Yimchungre section includes 28514 words and 183258 characters. Overall, the corpus comprises 44,04,845 words (over 4.4 million tokens) and 2,33,74,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Kamaraj S, Dr. Rejitha K. S, Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Yimchungre Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-35-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Zeliang Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 |Zeliang Words: 33883 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Zeliang parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Zeliang section includes 33883 words and 160654 characters. Overall, the corpus comprises 44,04,845 words (over 4.4 million tokens) and 2,33,74,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Dr. Kamaraj S, Dr. Rejitha K. S, Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Zeliang Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69175-74-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Zemi Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Zemi Words: 36531| 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Zemi parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Zemi section includes 36531 words and 170892 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Zemi Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-86-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Zou Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Zou Words: 35477| 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Zou parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Zou section includes 35477 words and 173541 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Amom Nandaraj Meetei, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Zou Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore 978-81-69175-30-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

A Gold Standard Maithili Raw Text Corpus

requests (28)

53,16,552 Words | 499 Tittles | XML format | 5 domainsMaithili is an Indio-Aryan language, a direct descendant of Sanskrit. Which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled language of India. LDC-IL Maithili Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Maithili text can be broadly classified as literary and non- literary texts. Huge amount of literary texts are available in Maithili but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Maithili. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused.Maithili Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details:DomainsWordsPercentage of TotalCorpusAesthetics 38,97,26473.30 %Commerce50,97500.96 %Mass Media12,53,09023.57 %Science and Technology3,13600.06 %Social Sciences1,12,08702.11 %A detailed explanation of the Maithili Raw Text Corpus will be available in the Maithili Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh & Dinesh Mishra. 2019. A Gold Standard Maithili Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Multilingual Raw Speech Corpus

requests (20)

97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav. The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc. The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers. The available Speech Corpus details: Total Speakers 1916 (958 Female and 958 Male) Assamese 2:33:40 68 1.64 2:34:33 64 1.65 5:08:13 132 3.30 Bengali 2:38:34 56 1.59 2:47:32 61 1.69 5:26:06 117 3.29 Bodo 2:30:39 42 1.61 2:41:04 40 1.72 5:11:43 82 3.34 Dogri 1:16:44 30 0.84 1:35:00 31 1.01 2:51:44 61 1.84 Gujarati 2:32:10 45 1.63 2:30:40 42 1.61 5:02:50 87 3.25 Hindi 2:37:28 44 1.66 2:30:18 44 1.57 5:07:46 88 3.23 Kannada 2:37:06 45 1.68 2:32:50 48 1.63 5:09:56 93 3.32 Kashmiri 2:32:26 30 1.63 2:39:46 29 1.71 5:12:12 59 3.34 Konkani 2:50:24 62 1.82 2:41:25 62 1.74 5:31:49 124 3.57 Maithili 2:46:28 54 1.71 2:53:31 50 2.00 &..

Quickview

Indian English Raw Speech Corpus - Bengali Variant

requests (8)

25:47:11 Hours | 15.5 GB | 53 Speakers| 16,044 Audio Segments | 48 kHz | 16 bit wav.English language is a blend of Anglo-Saxon which is the prominent language of Britain in middle ages. It has been propagated to every corner of the world by colonists. English emerges as the most visible legacy of British in India because India was under British raj for almost two centuries and English is a part of education system here. Most of the states in India use their regional languages and do not have a common language to communicate. So English is used for inter-state communication.LDC-IL has 25 hours Indian English - Bengali Variant speech data. The LDC-IL Indian English Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 27 female and 26 Male from Bengali mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 53 (27 Female and 26 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 52 6:03:15 Creative Text 52 2:41:17 Sentence 1300 1:29:35 Date Format 104 0:08:56 Command and Control Words 2882 3:09:13 Person Name 1040 0:33:56 Place Name 519 1:30:22 Most Frequent Word - Part 1442 1:22:38 Most Frequent Word - Full Set 5985 6:01:44 Phonetically Balanced 1782 1:52:21 Form and Function - Word 886 0:53:54 A detailed explanation of the Indian English Raw Speech Corpus - Bengali Variant will be available in the Indian English Raw Speech Corpus - Bengali Variant Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Arundhati Sengupta, Rejitha KS, Rajesha N., Manasa, G., 2021. Indian English Raw Speech Corpus - Bengali Variant. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

A Gold Standard Kannada Raw Text Corpus

requests (34)

77,63,124 words | 1772 Titles | Data and Metadata in XML format | 6 text domainsKannada is one of the Ancient Indian language which belongs to the Dravidian family. It has its own script. Even though Kannada is considered as a classical language because of its ancient history in literature, the Kannada text corpus is extracted from contemporary text sources. To keep the corpus balanced, the Kannada text corpus is collected by keying-in and proofing text extracts from books of various domains or Crawled from News websites. The available corpus is in Unicode standard and the data with metadata is in XML format. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 37,78,723 48.68 % Commerce 2,07,053 2.67 % Mass Media 2,07,053 34.54 % Official Document 5,357 0.07 % Science and Technology 2,43,166 3.13 % Social Sciences 8,47,214 10.91 % A detailed explanation of the Kannada Text Corpus will be available in the Kannada Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N Abhyankar, Rajesha N. & Manasa G. 2019. A Gold Standard Kannada Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Indian English Raw Speech Corpus - Kannada Variant

requests (8)

23:43:04 Hours | 15.3 GB | 56 Speakers| 14,455 Audio Segments | 48 kHz | 16 bit wav. English language is a blend of Anglo-Saxon which is the prominent language of Britain in middle ages. It has been propagated to every corner of the world by colonists. English emerges as the most visible legacy of British in India because India was under British raj for almost two centuries and English is a part of education system here. Most of the states in India use their regional languages and do not have a common language to communicate. So English is used for inter-state communication. LDC-IL has 23 hours Indian English – Kannada Variant speech data. The LDC-IL Indian English Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 29 female and 27 Male from Kannada mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 56 (29 Female and 27 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 52 7:19:31 Creative Text 58 3:57:15 Sentence 1522 1:54:10 Date Format 106 0:04:32 Command and Control Words 2543 1:55:43 Person Name 2040 0:39:43 Place Name 762 2:38:49 Most Frequent Word - Part 1563 1:09:10 Most Frequent Word - Full Set 3999 2:49:55 Phonetically Balanced 1194 0:49:21 Form and Function - Word 616 0:24:55 A detailed explanation of the Indian English Raw Speech Corpus - Kannada Variant will be available in the Indian English Raw Speech Corpus - Kannada Variant Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Bharatha Raju A., Rejitha KS, Rajesha N., Manasa G., 2021. Indian English Raw Speech Corpus - Kannada Variant. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

A Gold Standard Tamil Raw Text Corpus

requests (21)

1,09,31,902 Words | 1,963 Titles | XML format | 6 text domainsTamil is one of the longest-surviving Classical Languages in the world. It is a Dravidian Language Family. Tamil Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. Tamil is one of the longest-surviving classical languages in the world. It is a Dravidian language spoken in Tamil Nadu and Sri Lanka, in East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china, Fiji, in South-Africa and British Guinea and in islands like Mauritius and Madagascar etc. The language is an official language in Tamil Nadu and some of the foreign countries such as Sri Lanka and Singapore. It has official status in the Indian state of Tamil Nadu and the Indian Union Territory of Pondicherry. Linguistic Data Consortium for Indian Languages (LDC-IL) Tamil Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Tamil text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Tamil but scientific texts are less, thus LDC-IL attempts to develop balanced text corpora of Tamil. Data has been collected from books, Magazines, and Newspapers and it is verified to true to the original texts then warehoused.The available Text Corpus details are as follows: Domains Words Percentage of Total Corpus Aesthetics 55,95,316 51.18 % Commerce 83,148 00.76 % Mass Media 21,00,226 19.21 % Official Document 12,768 0.12 % Science and Technology 88,65,532 8.11 % Social Sciences 22,53,912 20.62 % A detailed explanation of the Tamil Raw Text Corpus will be available in the Tamil Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, G. Palanirajan, S. Thennarasu, Prem Kumar L. R, Amudha R., Prabagaran R., Vijayan N. & M. Ramesh Kumar. 2019. A Gold Standard Tamil Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Angami Parallel Text Corpus: Linguistic Features and Structures

requests (0)

Total Words: 4,404,845 | Angami Words: 28527 | 5,332 sentences/phrases in each mother tonguesIndia has 270 mother tongues as per 2011 census. Following the requirements of the NEP-2020, LDC-IL developed parallel corpus in Indian mother tongues. The Angami parallel text corpus connected with English and 146 mother tongues of India. It contains 5,332 sentences/phrases systematically structured based on 159 grammatical categories. The Angami section includes 28527 words and 150397 characters. Overall, the corpus comprises 4,404,845 words (over 4.4 million tokens) and 23,374,289 characters (approximately 23.3 million).The price indicated corresponds to a single language component. The total payment will be determined based on the number of language components requested by the seeker.For any research-based citations, please use the following citations:1. Yumnam Premila Chanu, Dr. Rejitha K. S., Dr. Narayan Choudhary, Prof. Shailendra Mohan. 2026. Angami Parallel Text Corpus: Linguistic Features and Structures. Central Institute of Indian Languages, Mysore. 978-81-69099-40-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...