Central Institute of Indian Languages
Punjabi Raw Speech Corpus
101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230 Audio Segments | 48 kHz | 16 bit wav. Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. Here we are talking about only Indian Gurmukhi Punjabi. The Punjabi language has four different dialects, spoken in the different sub-regions of Punjab. The LDC-IL Punjabi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. LDC-IL collected speech data from Malwa, Doab and Puadh regions.The available Speech Corpus details:Total Speakers 467(234 Female and 233 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 448 27:07:41 Creative Text 446 19:29:15 Sentence 11,168 08:58:33 Date Format 887 00:27:53 Command and Control Words 13,274 07:49:16 Person Name 8,949 10:28:40 Place Name 4,473 03:17:02 Most Frequent Word - Part 8,889 05:21:56 Most Frequent Word - Full Set 3,988 02:52:44 Phonetically Balanced 13,939 08:56:04 Form and Function - Word 9,769 06:24:07 A detailed explanation of the Punjabi Speech Corpus will be available in the Punjabi Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon & Sarbjeet Kaur. 2019. Punjabi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Tamil Raw Speech Corpus
139:11:41 Hours | 86 GB speech data | 452 Speakers | 60,287 Audio segments | 48 kHz | 16 bit wav. Tamil is one of the longest-surviving classical languages in the world. It is one of the prominent language among the Dravidian language family. Tamil is widely spoken in the state of Tamil Nadu, Union Territory of Pondicherry, Sri Lanka, in East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china, Fiji, in South-Africa, British Guinea and in islands like Mauritius and Madagascar etc. The language is an official language in Tamil Nadu and some of the foreign countries such as Sri Lanka and Singapore. It has official status in the Indian state of Tamil Nadu and the Indian Union Territory of Pondicherry. Tamil has its own font. The language is highly agglutinative in nature. Tamil has Phonological simplicity, Morphological parity and primitiveness. There is separability and significance of all affixes in Tamil language. There is an absence nominative case termination and arbitrary words in Tamil language. The LDC-IL speech data is collected from the regions of Kongu, Kumari, Madurai, Nellai, Salem and Thanjai, from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 452 (214 Female and 219 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 433 57:53:48 Creative Text 429 14:21:31 Sentence 10,764 14:51:03 Date Format 842 01:20:17 Command and Control Words 12,882 12:57:06 Person Name 8,755 03:57:29 Place Name 4,002 10:34:38 Most Frequent Word - Part 12,813 11:14:05 Most Frequent Word - Full Set 2,000 02:26:05 Phonetically Balanced 3,860 04:55:10 Form and Function - Word 3,507 04:40:29 A detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Thennarasu S, Prem Kumar L R, Amudha R, Prabagaran R, Srikanth D. 2021. Tamil Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Narayan Choudhary, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Tamil Sentence Aligned Speech Corpus
Dataset Description: 74:57:59 hours | 46.4 GB | 48,572 Audio Segments | 433 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Tamil Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Tamil script. This dataset spans a duration of 74:57:59 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 214 female and 219 male native Tamil speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Tamil Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Amudha R., Kamaraj S., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes,Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Tamil Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-26-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Telugu Raw Speech Corpus
22:43:59 Hours | 15 GB | 80 Speakers | 10,510 Audio Segments | 48 kHz | 16 bit wav. Telugu is the official language of Telangana and the Andhra Pradesh States. It belongs to the Dravidian language family. Among the Dravidian languages, Telugu is spoken by the largest population. Telugu is agglutinative in nature and its vocabulary is very much influenced by Sanskrit. LDC-IL considered Telugu has three specifically different varieties, thus collected speech data from Telangana, Rayalaseema and Coastal Andhra. The LDC-IL Telugu Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. Speech is in .wav format and Metadata is in .txt format.The available Speech Corpus details:Total Speakers 80 (24 Female and 56 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 77 8:28:19 Creative Text 77 7:01:16 Sentence 1,828 1:20:55 Date Format 142 0:13:58 Command and Control Words 2,170 1:43:49 Person Name 1,438 1:09:31 Place Name 707 0:33:24 Most Frequent Word - Part 2,162 1:31:24 Most Frequent Word - Full Set 1,909 0:41:23 A detailed explanation of the Telugu Speech Corpus will be available in the Telugu Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary & Rajesha N. 2019. Telugu Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Urdu Raw Speech Corpus
99:18:21 Hours | 64.2 GB | 499 Speakers | 88,708 Audio Segments | 48 kHz | 16 bit wav. Urdu is one of the Modern Indo-Aryan languages of India. It evolved from Shaurseni Apabhramsha. It uses Persio-Arabic script. The language in a region is influenced by other languages of the region, mother tongue of the speaker, etc. The reading speed, loudness, frequency etc. also differ depending on certain factors like age, gender etc. Linguistic data consortium collected the speech corpus through fieldwork. This read data is collected from various age groups of male and female native speakers. This data includes Texts, Sentences, Date Formats, and different wordlists.The available Speech Corpus details: Total Speakers - 499 (252 Female and 247 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 431 25:35:02 Creative Text 433 19:40:11 Sentence 10,646 8:00:38 Date Format 846 0:43:37 Command and Control Words 13,580 9:21:01 Person Name 6,577 2:55:41 Place Name 4,273 1:09:17 Most Frequent Word - Part 12,802 7:46:28 Most Frequent Word - Full Set 18,927 11:38:30 Phonetically Balanced Vocabulary 13,646 8:13:20 Form and Function Word 6,547 4:14:36 A detailed explanation of the Urdu Speech Corpus will be available in the Urdu Speech Data Documentation.For any research based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam & Rushda Idris Khan. 2019. Urdu Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Urdu Sentence Aligned Speech Corpus
Dataset Description:50:09:56 hours | 32.3 GB | 32,384 Audio Segments | 434 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Urdu Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Persio-Arabic script. This dataset spans a duration of 50:09:56 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 214 female and 219 male native Urdu speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Urdu Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Urdu Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-87-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
A Gold Standard Maithili Raw Text Corpus
53,16,552 Words | 499 Tittles | XML format | 5 domainsMaithili is an Indio-Aryan language, a direct descendant of Sanskrit. Which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled language of India. LDC-IL Maithili Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Maithili text can be broadly classified as literary and non- literary texts. Huge amount of literary texts are available in Maithili but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Maithili. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused.Maithili Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details:DomainsWordsPercentage of TotalCorpusAesthetics 38,97,26473.30 %Commerce50,97500.96 %Mass Media12,53,09023.57 %Science and Technology3,13600.06 %Social Sciences1,12,08702.11 %A detailed explanation of the Maithili Raw Text Corpus will be available in the Maithili Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh & Dinesh Mishra. 2019. A Gold Standard Maithili Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...
Multilingual Raw Speech Corpus
97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav. The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc. The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers. The available Speech Corpus details: Total Speakers 1916 (958 Female and 958 Male) Assamese 2:33:40 68 1.64 2:34:33 64 1.65 5:08:13 132 3.30 Bengali 2:38:34 56 1.59 2:47:32 61 1.69 5:26:06 117 3.29 Bodo 2:30:39 42 1.61 2:41:04 40 1.72 5:11:43 82 3.34 Dogri 1:16:44 30 0.84 1:35:00 31 1.01 2:51:44 61 1.84 Gujarati 2:32:10 45 1.63 2:30:40 42 1.61 5:02:50 87 3.25 Hindi 2:37:28 44 1.66 2:30:18 44 1.57 5:07:46 88 3.23 Kannada 2:37:06 45 1.68 2:32:50 48 1.63 5:09:56 93 3.32 Kashmiri 2:32:26 30 1.63 2:39:46 29 1.71 5:12:12 59 3.34 Konkani 2:50:24 62 1.82 2:41:25 62 1.74 5:31:49 124 3.57 Maithili 2:46:28 54 1.71 2:53:31 50 2.00 &..
Indian English Raw Speech Corpus - Bengali Variant
25:47:11 Hours | 15.5 GB | 53 Speakers| 16,044 Audio Segments | 48 kHz | 16 bit wav.English language is a blend of Anglo-Saxon which is the prominent language of Britain in middle ages. It has been propagated to every corner of the world by colonists. English emerges as the most visible legacy of British in India because India was under British raj for almost two centuries and English is a part of education system here. Most of the states in India use their regional languages and do not have a common language to communicate. So English is used for inter-state communication.LDC-IL has 25 hours Indian English - Bengali Variant speech data. The LDC-IL Indian English Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 27 female and 26 Male from Bengali mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 53 (27 Female and 26 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 52 6:03:15 Creative Text 52 2:41:17 Sentence 1300 1:29:35 Date Format 104 0:08:56 Command and Control Words 2882 3:09:13 Person Name 1040 0:33:56 Place Name 519 1:30:22 Most Frequent Word - Part 1442 1:22:38 Most Frequent Word - Full Set 5985 6:01:44 Phonetically Balanced 1782 1:52:21 Form and Function - Word 886 0:53:54 A detailed explanation of the Indian English Raw Speech Corpus - Bengali Variant will be available in the Indian English Raw Speech Corpus - Bengali Variant Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Arundhati Sengupta, Rejitha KS, Rajesha N., Manasa, G., 2021. Indian English Raw Speech Corpus - Bengali Variant. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
A Gold Standard Kannada Raw Text Corpus
77,63,124 words | 1772 Titles | Data and Metadata in XML format | 6 text domainsKannada is one of the Ancient Indian language which belongs to the Dravidian family. It has its own script. Even though Kannada is considered as a classical language because of its ancient history in literature, the Kannada text corpus is extracted from contemporary text sources. To keep the corpus balanced, the Kannada text corpus is collected by keying-in and proofing text extracts from books of various domains or Crawled from News websites. The available corpus is in Unicode standard and the data with metadata is in XML format. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 37,78,723 48.68 % Commerce 2,07,053 2.67 % Mass Media 2,07,053 34.54 % Official Document 5,357 0.07 % Science and Technology 2,43,166 3.13 % Social Sciences 8,47,214 10.91 % A detailed explanation of the Kannada Text Corpus will be available in the Kannada Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N Abhyankar, Rajesha N. & Manasa G. 2019. A Gold Standard Kannada Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...
Indian English Raw Speech Corpus - Kannada Variant
23:43:04 Hours | 15.3 GB | 56 Speakers| 14,455 Audio Segments | 48 kHz | 16 bit wav. English language is a blend of Anglo-Saxon which is the prominent language of Britain in middle ages. It has been propagated to every corner of the world by colonists. English emerges as the most visible legacy of British in India because India was under British raj for almost two centuries and English is a part of education system here. Most of the states in India use their regional languages and do not have a common language to communicate. So English is used for inter-state communication. LDC-IL has 23 hours Indian English – Kannada Variant speech data. The LDC-IL Indian English Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 29 female and 27 Male from Kannada mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 56 (29 Female and 27 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 52 7:19:31 Creative Text 58 3:57:15 Sentence 1522 1:54:10 Date Format 106 0:04:32 Command and Control Words 2543 1:55:43 Person Name 2040 0:39:43 Place Name 762 2:38:49 Most Frequent Word - Part 1563 1:09:10 Most Frequent Word - Full Set 3999 2:49:55 Phonetically Balanced 1194 0:49:21 Form and Function - Word 616 0:24:55 A detailed explanation of the Indian English Raw Speech Corpus - Kannada Variant will be available in the Indian English Raw Speech Corpus - Kannada Variant Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Bharatha Raju A., Rejitha KS, Rajesha N., Manasa G., 2021. Indian English Raw Speech Corpus - Kannada Variant. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
A Gold Standard Tamil Raw Text Corpus
1,09,31,902 Words | 1,963 Titles | XML format | 6 text domainsTamil is one of the longest-surviving Classical Languages in the world. It is a Dravidian Language Family. Tamil Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. Tamil is one of the longest-surviving classical languages in the world. It is a Dravidian language spoken in Tamil Nadu and Sri Lanka, in East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china, Fiji, in South-Africa and British Guinea and in islands like Mauritius and Madagascar etc. The language is an official language in Tamil Nadu and some of the foreign countries such as Sri Lanka and Singapore. It has official status in the Indian state of Tamil Nadu and the Indian Union Territory of Pondicherry. Linguistic Data Consortium for Indian Languages (LDC-IL) Tamil Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Tamil text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Tamil but scientific texts are less, thus LDC-IL attempts to develop balanced text corpora of Tamil. Data has been collected from books, Magazines, and Newspapers and it is verified to true to the original texts then warehoused.The available Text Corpus details are as follows: Domains Words Percentage of Total Corpus Aesthetics 55,95,316 51.18 % Commerce 83,148 00.76 % Mass Media 21,00,226 19.21 % Official Document 12,768 0.12 % Science and Technology 88,65,532 8.11 % Social Sciences 22,53,912 20.62 % A detailed explanation of the Tamil Raw Text Corpus will be available in the Tamil Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, G. Palanirajan, S. Thennarasu, Prem Kumar L. R, Amudha R., Prabagaran R., Vijayan N. & M. Ramesh Kumar. 2019. A Gold Standard Tamil Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...