Central Institute of Indian Languages
Maithili Text to Speech Corpus
30:59:20 hours | 19.56 GB | 32260 Audio Segments | 2 SpeakersThe LDC-IL Maithili Text to Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer in Devanagari script. This dataset spans a duration of 32:42:20 (hh:mm:ss) , consisting of read speech in the studio setup. The data is derived from 01 female and 01 male native Maithili speakers. A comprehensive explanation of dataset can be found in the Maithili Text to Speech Documentation.For any research-based citations, please use the following citations:Shantanu Kumar, Dinesh Mishra, Saurabh Varik, Stephen Fernandes, Nithin S., Roopashri M. R., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Maithili Text to Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-36-1.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Malayalam Raw Speech Corpus
164:01:02 Hours | 105 GB | 458Speakers| 43670 Audio Segments |48 kHz | 16 bit wav.Malayalam is the official language of Kerala and Laccadive Islands. It belongs to the Dravidian language family. According to the formation of Kerala and the language of Travancore, Cochin, and Malabar regions are influenced by different internal and external factors so LDC-IL considered Malayalam has three specifically different varieties, thus collected speech data from Thiruvananthapuram, Ernakulam, and Kozhikode. LDC-IL has 164 hours Malayalam speech data. The LDC-IL Malayalam Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 231 female and 227 Male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 458(231 Female and 227 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 449 71:29:21 Creative Text 449 54:41:20 Sentence 7,452 06:56:46 Date Format 598 00:53:45 Command and Control Words 8,923 07:09:37 Person Name 5,819 05:26:33 Place Name 2,906 02:28:24 Most Frequent Word - Part 8,763 06:51:31 Most Frequent Word - Full Set 1,979 02:08:58 Phonetically Balanced 3,096 02:40:09 Form and Function - Word 3,236 03:14:38 A detailed explanation of the Malayalam Speech Corpus will be available in the Malayalam Speech Data Documentation.For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Saritha S.L., Rejitha K.S., Sajila S. & Midhun P. G. 2019. Malayalam Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Malayalam Sentence Aligned Speech Corpus
Dataset Description: 123:29:55 hours | 79.6 GB | 89,269 Audio Segments | 451 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Malayalam Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Malayalam script. This dataset spans a duration of 123:29:55 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 229 female and 222 male native Malayalam speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Malayalam Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Sajila S., Saritha S.L., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Malayalam Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-58-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Manipuri Raw Speech Corpus
156:28:32 hours | 100 GB | 620 Speakers | 66,231 Audio segments | 48 khz | 16 bit wav Manipuri is the Administrative Language of Manipur. The development of LDC-IL Speech Data for Manipuri lies in capturing all the distinctive characteristics of speeches shared by different regional dialects of Manipur. In order to do so, certain linguistic features identifying regional tones and intonations, phonemic distributions, various pronunciations reflected in both regional and non-regional vocabulary items such as person names and place names etc., have been well housed based on a standard parameter of the dataset. Out of the entire dataset, each specific subset to be read by the corresponding speaker is randomly generated for ‘a read speech corpus’. In this way, each random set is read by a speaker. Limited Full Sets are made read completely by assured selected speakers in each age group. The data is collected from three regional dialects, namely Imphal, Kakching, and Awang Sekmai respectively through fieldwork. The age group ranges selected for fieldwork are ‘16 to 20’, ‘21 to 50’, and ‘above 50 years’ respectively. Equal number of male and female data is collected from each age group. The available Speech Corpus details : Total Speakers620(310 Female and 310 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 530 59:47:22 Creative Text 588 53:59:03 Sentence 10,979 10:01:41 Date Format 866 01:12:04 Command and Control Words 13,129 08:00:02 Person Name 8,789 07:14:04 Place Name 4,394 02:46:29 Most Frequent Word - Part 13,167 06:48:50 Most Frequent Word - Full Set 6,992 02:48:42 Phonetically Balanced 4,518 02:25:53 Form and Function - Word 2,279 01:23:50 A detailed explanation of the Manipuri Speech corpus will be available in the Manipuri Raw Speech Corpus Documentation.For any research based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu & Longjam Anand Singh. 2019. Manipuri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Manipuri Sentence Aligned Speech Corpus (Bengali Script)
The LDC-Manipuri Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing orthographically normalized annotation in Bengali script. This dataset spans a duration of 123:29:55 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 295female and 294 male native Manipuri speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Manipuri Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:Amom Nandaraj Meetei, Yumnam ,Premila Chanu, Rajesha N, Manasa,G, Stephen Fernandes, Nithin S, Roopashri M.R ,Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Manipuri Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-68-2..
Manipuri Sentence Aligned Speech Corpus (Meetei Mayek)
The LDC-Manipuri Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing orthographically normalized annotation in Meetei Mayek. This dataset spans a duration of 123:29:55 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 295female and 294 male native Manipuri speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Manipuri Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:Amom Nandaraj Meetei, Yumnam,Premila Chanu, Rajesha N., Manasa,G., Stephen Fernandes, Nithin S.,Roopashri M.R.,Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Manipuri Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-96-5..
Marathi Raw Speech Corpus
89:17:25 Hours | 58 GB speech data | 307 Speakers | 58544 Audio segments | 48 kHz | 16 bit wav.The Marathi language is an Indo-Aryan language. The Marathi language is prevalent in the 9th century. Standard Marathi (Puneri) is the official language of the State of Maharashtra. Standard Marathi is based on dialects used by academics and the print media. It is believed that the language of the Marathi language is influenced by Sanskrit. Marathi is written in the Devanagari script. The phoneme inventory of Marathi is similar to that of many other Indo-Aryan languages. The LDC-IL speech data is collected from the regions of Marathwada, Puneri, Vidharbh, and Goa from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 307 (156 Female and 151 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 302 22:26:06 Creative Text 302 13:37:34 Sentence 7,555 6:49:58 Date Format 604 0:39:57 Command and Control Words 9,068 7:50:10 Person Name 6,058 7:44:56 Place Name 3,037 2:49:32 Most Frequent Word - Part 9,104 7:22:57 Most Frequent Word - Full Set 10,987 9:53:28 Phonetically Balanced 4,609 4:10:47 Form and Function - Word 6,918 5:52:00 A detailed explanation of the Marathi Speech Corpus will be available in the Marathi Speech Data Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Gajanan R Apine & Apurva P Betkekar. 2019. Marathi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Marathi Sentence Aligned Speech Corpus
Dataset Description: 41:34:04 hours | 26.7 GB | 23,234 Audio Segments | 302 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Marathi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 89:17:25 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 153 female and 149 male native Marathi speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Marathi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Bhageshree K Khandale, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Marathi Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-92-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Nepali Raw Speech Corpus
87:14:44 Hours | 56.5GB | 350 Speakers | 48975 Audio Segments | 48 kHz | 16 bit wav.Nepali belongs to the Indo-Aryan language family. Nepali is the official language of Nepal and Indian State of West Bengal and Sikkim, and spoken in the states of Uttaranchal, Assam, Arunachal Pradesh, Manipur, Mizoram and Bihar, and as well as in other countries like Myanmar, Bhutan etc. It is written in Devanagari script. The LDC-IL Nepali speech data is collected from the regions of Darjeeling, Assam and Dehradun, from both the genders and different age group. The LDC-IL Nepali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 350 (187 Female and 163 Male)) Domains Audio Segments Each Domain Duration Contemporary Text (News) 343 14:33:19 Creative Text 341 19:46:34 Sentence 8,583 13:45:34 Date Format 1,029 00:57:20 Command and Control Words 10,308 08:44:19 Person Name 6,878 09:15:04 Place Name 3,398 03:20:06 Most Frequent Word - Part 10,292 08:51:06 Most Frequent Word - Full Set 2,994 03:41:39 Phonetically Balanced 3,321 03:00:08 Form and Function - Word 1,488 01:19:35 A detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai & Rupesh Rai. 2019. Nepali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Nepali Sentence Aligned Speech Corpus
Dataset Description: 43:04:23 hours | 27.7 GB | 21,481 Audio Segments | 346 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Nepali Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 43:04:23 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 187 female and 159 male native Nepali speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Nepali Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Rupesh Rai, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Nepali Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-98-6.2.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Odia Raw Speech Corpus
138:06:18 hours | 89 GB | 474 Speakers | 73,418 Audio segments | 48 kHz | 16 bit wav.Odia is an Indo-Aryan language; which is mainly spoken in the state of Odisha and also in some of the border states like West Bengal, Jharkhand, Chhatisgarh and Andhra Pradesh. It is designated with Classical Language Status by the Govt. of India. The LDC-IL Odia speech data is collected from the Central and Northern parts of Odisha from both the genders and different age groups. This data consists of different types of datasets that are made up of word lists, sentences include running texts and date formats.The available Speech Corpus details:Total Speakers 474 (239 Female and 235 Male)DomainsAudio SegmentsEach DomainDurationContemporary Text (News)44942:49:56Creative Text45019:43:50Sentence11,2488:22:57Date Format9001:27:49Command and Control Words13,49914:18:49Person Name8,9985:01:40Place Name4,49613:22:45Most Frequent Word - Part8,9949:40:04Most Frequent Word - Full Set10,98910:21:04Phonetically Balanced10,43810:05:10Form and Function - Word2,9572:52:14A detailed explanation of the Bengali Speech Corpus will be available in the Odia Raw Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Raja Kumar Naik, Pramod Kumar Rout, Kshirod Kumar Das & Santosh Kumar Mohanty. 2021. Odia Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Punjabi Raw Speech Corpus
101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230 Audio Segments | 48 kHz | 16 bit wav. Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. Here we are talking about only Indian Gurmukhi Punjabi. The Punjabi language has four different dialects, spoken in the different sub-regions of Punjab. The LDC-IL Punjabi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. LDC-IL collected speech data from Malwa, Doab and Puadh regions.The available Speech Corpus details:Total Speakers 467(234 Female and 233 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 448 27:07:41 Creative Text 446 19:29:15 Sentence 11,168 08:58:33 Date Format 887 00:27:53 Command and Control Words 13,274 07:49:16 Person Name 8,949 10:28:40 Place Name 4,473 03:17:02 Most Frequent Word - Part 8,889 05:21:56 Most Frequent Word - Full Set 3,988 02:52:44 Phonetically Balanced 13,939 08:56:04 Form and Function - Word 9,769 06:24:07 A detailed explanation of the Punjabi Speech Corpus will be available in the Punjabi Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon & Sarbjeet Kaur. 2019. Punjabi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Punjabi Sentence Aligned Speech Corpus
52:24:51 hours | 34:8 GB | 31,338 Audio Segments | 449 SpeakersThe LDC-IL Punjabi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Gurmukhi script. This dataset spans a duration of 52:24:51 (hh:mm:ss) , consisting of read speech with continuous text, representative sentences, and date formats. A comprehensive explanation of dataset can be found in the Punjabi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:Dr. Shalinder Singh, Rajesha N., Manasa G., Stephen Fernandes, Nithin S., Roopashri M. R., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Punjabi Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-69-9.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
Tamil Raw Speech Corpus
139:11:41 Hours | 86 GB speech data | 452 Speakers | 60,287 Audio segments | 48 kHz | 16 bit wav. Tamil is one of the longest-surviving classical languages in the world. It is one of the prominent language among the Dravidian language family. Tamil is widely spoken in the state of Tamil Nadu, Union Territory of Pondicherry, Sri Lanka, in East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china, Fiji, in South-Africa, British Guinea and in islands like Mauritius and Madagascar etc. The language is an official language in Tamil Nadu and some of the foreign countries such as Sri Lanka and Singapore. It has official status in the Indian state of Tamil Nadu and the Indian Union Territory of Pondicherry. Tamil has its own font. The language is highly agglutinative in nature. Tamil has Phonological simplicity, Morphological parity and primitiveness. There is separability and significance of all affixes in Tamil language. There is an absence nominative case termination and arbitrary words in Tamil language. The LDC-IL speech data is collected from the regions of Kongu, Kumari, Madurai, Nellai, Salem and Thanjai, from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 452 (214 Female and 219 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 433 57:53:48 Creative Text 429 14:21:31 Sentence 10,764 14:51:03 Date Format 842 01:20:17 Command and Control Words 12,882 12:57:06 Person Name 8,755 03:57:29 Place Name 4,002 10:34:38 Most Frequent Word - Part 12,813 11:14:05 Most Frequent Word - Full Set 2,000 02:26:05 Phonetically Balanced 3,860 04:55:10 Form and Function - Word 3,507 04:40:29 A detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Thennarasu S, Prem Kumar L R, Amudha R, Prabagaran R, Srikanth D. 2021. Tamil Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Narayan Choudhary, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Tamil Sentence Aligned Speech Corpus
Dataset Description: 74:57:59 hours | 46.4 GB | 48,572 Audio Segments | 433 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Tamil Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Tamil script. This dataset spans a duration of 74:57:59 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 214 female and 219 male native Tamil speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Tamil Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Amudha R., Kamaraj S., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes,Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Tamil Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-26-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..