Central Institute of Indian Languages

Quickview

Gujarati Raw Speech Corpus

requests (16)

57:17:08 Hours | 37 GB | 204 Speakers| 25,712 Audio Segments | 48 kHz | 16 bit wav. Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra.LDC-IL has 57:17:08 hours Gujarati raw speech data. The LDC-IL Gujarat Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 96 female and 108 male from Gujarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 204 (96 Female and 108 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)20415:21:28Creative Text20211:34:29Sentence50815:48:32Date4040:41:39Command and Control Words60067:17:22Person Name40796:36:02Place Name20412:33:20Most Frequent Word - Part42365:18:47Most Frequent Word – Full Set20001:13:39Phonetically Balanced13780:51:50A detailed explanation of the Gujarati Raw Speech Corpus will be available in the Gujarati Raw Speech Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Hiren Gadhavi R, Solanki Mahesh kumar R, Rejitha K. S., Rajesha N., Manasa, G.., 2021. Gujarati Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Gujarati Raw Speech Corpus(Mono Recordings)

requests (14)

64:44:02 Hours | 7.1 GB | 233 Speakers| 26,223 Audio Segments | 16 kHz | 16 bit wav. Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra. LDC-IL has 64:44:02 hours Gujarati raw speech data as Mono recording. The LDC-IL Gujarati Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 124 female and 109 male from Guajarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 233 (124 Female and 109 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 233 12:52:46 Creative Text 232 13:30:15 Sentence 5824 7:12:17 Date Format 466 0:59:31 Command and Control Words 6985 9:43:07 Person Name 4644 8:34:44 Place Name 2322 3:17:06 Phonetically Balanced 4131 6:28:15 Form and Function - Word 1386 2:06:01 A detailed explanation of the Gujarati Raw Speech Corpus (Mono Recordings) will be available in the Gujarati Raw Speech (Mono Recordings) Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Rejitha KS, Rajesha N., Manasa, G.2021. Gujarati Raw Speech Corpus(Mono Recordings). Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Hindi Raw Speech Corpus

requests (40)

121:00:06 Hours | 76.6 GB | 488 Speakers | 70686 Audio Segments | 48 kHz | 16 bit wav.Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh. The LDC-IL speech data is collected from the regions of Awadhi belt, Bhojpuri belt, Magahi belt and Khariboli belt from both the genders and different age groups. LDC-IL Hindi speech data has 121:00:06 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 488 (234 Female and 254 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 457 37:22:29 Creative Text 463 29:24:08 Sentence 10173 8:41:17 Date Format 764 0:46:56 Command and Control Words 12284 8:34:51 Person Name 8171 9:55:25 Place Name 4085 3:14:44 Most Frequent Word - Part 12315 8:09:10 Most Frequent Word - Full Set 6994 4:30:14 Phonetically Balanced 11986 8:23:43 Form and Function - Word 2994 1:57:09 A detailed explanation of the Hindi Speech Corpus will be available in the Hindi Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra, Arimardan Kumar Tripathi & Satyaendra Kumar Awasthi. 2019. Hindi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Hindi Sentence Aligned Speech Corpus

requests (10)

Dataset Description: 72:34:52 hours | 45.9 GB | 42,275 Audio Segments | 473 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Hindi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 72:34:52 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 225 female and 248 male native Hindi speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Hindi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Satyaendra Kumar Awasthi, Ankita Tiwari, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Hindi Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-28-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Indian English-Bengali Sentence Aligned Speech Corpus

requests (4)

Dataset Description:09:21:08 hours | 5.53 GB | 5,676 Audio Segments | 52 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Indian English-Bengali variant Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing Roman script. This dataset spans a duration of 09:21:08 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 26 female and 26 male native Bengali speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Indian English-Bengali variant Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Poulami Das, Rajesha N., Manasa G., Srikanth D., Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Indian English-Bengali variant Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-43-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Indian English-Kannada Sentence Aligned Speech Corpus

requests (5)

Dataset Description:11:17:40 hours | 7.27 GB | 6,166 Audio Segments | 53 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Indian English-Kannada variant Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing Roman script. This dataset spans a duration of 11:17:40 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 26 female and 27 male native Kannada speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Indian English-Kannada variant Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Vijayalaxmi F. Patil, Rajesha N., Manasa G., Srikanth D., Nithin S.,Narayan Kumar Choudhary, Shailendra Mohan. 2023 Indian English-Kannada variant Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-35-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Kannada Raw Speech Corpus

requests (30)

179:32:52 hours of 115 GB | 656 Speakers | 99109 Audio segments | 48 kHz | 16 bit wavKannada is one of the Ancient Indian languages belong to the Dravidian family. It has its own script. The language in a region is influenced by other languages of the region, the mother tongue of the speaker, etc. The reading speed, loudness, frequency etc also differ depending on certain factors like age, gender, etc. Linguistic data consortium identified four regional dialects and collected the speech corpus through fieldwork. This read data is collected from various age groups, of male and female native speakers in equal numbers. This data includes Texts, Sentences, Date Formats, and different wordlists. The available Speech Corpus details: Total Speakers - 656 (328 Female and 328 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 600 66:06:09 Creative Text 600 33:09:20 Sentence 14,887 13:58:15 Date Format 1,200 1:16:22 Command and Control Words 17,988 12:31:43 Person Name 12,009 13:04:49 Place Name 6,032 4:48:42 Most Frequent Word - Part 18,065 12:21:24 Most Frequent Word - Full Set 8,000 02:08:58 Phonetically Balanced 9,360 02:40:58 Form and Function - Word 10,368 03:14:38 A detailed explanation of the Kannada Speech Corpus will be available in the Kannada Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N. Abhyankar, Rajesha N. & Manasa G. 2019. Kannada Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Kannada Sentence Aligned Speech Corpus

requests (8)

Dataset Description: 107:48:50 hours | 69.4 GB | 65,533 Audio Segments | 600 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Kannada Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Kannada script. This dataset spans a duration of 107:48:50 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 300 female and 300 male native Kannada speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Kannada Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Vijayalaxmi F. Patil, Chetan Baji, Kavitha Lenin, Reshma S., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Kannada Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-19-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Kashmiri Raw Speech Corpus

requests (17)

28:10:07 Hours | 18 GB speech data | 150 Speakers | 16,380 Audio segments | 48 kHz | 16 bit wav. Kashmiri Language belongs to Dardic group of Indo-Aryan family. It is known by names ‘Kashur’ and‘Kashmiri’. It is primarily spoken in Kashmir valley and Pir-Panchal range of Jammu region. Kashmiri language has two types of dialects i.e., regional dialects and social dialects. Apart from the Kashmiri spoken in valley itself there are other varieties of language that are spoken outside the valley and those varieties are considered as regional dialects of Kashmiri language. These regional dialects consist of Kishtawari, Poguli and Rambani. Kashmiri language has three social dialects as well which are known by the names Yamraz, Marak and Kamraz. The LDC-IL speech data is collected from Kashmiri Valley are from Pulwama, Srinagar, and Anantnag. This data is collected from both the genders at different age groups. The LDC-IL Kashmiri Speech data consists of different types of datasets that are made up of words, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 150 (78 Female and 72 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 147 3:56:57 Creative Text 148 12:41:33 Sentence 3704 2:40:24 Date Format 281 0:10:36 Command and Control Words 4288 3:04:32 Person Name 2065 1:53:21 Place Name 1468 1:04:37 Most Frequent Word - Part 4279 2:38:07 A detailed explanation of Kashmiri Speech Corpus will be available in the Kashmiri Speech Data Documentation. For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Shahid Mushtaq Bhatt, Rajesha N., Manasa G., 2021. Kashmiri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Konkani Raw Speech Corpus

requests (19)

156:37:51 Hours | 100 GB | 504 Speakers | 72,938 Audio Segments | 48 kHz | 16 bit wav. Konkani belongs to the Indo-European family of languages. Konkani is the official language of Goa. However, the language is spoken widely across four states- Maharashtra, Goa, Karnataka and Kerala. Konkani is the only Indian language written in five different scripts - Devanagari, Roman, Kannada, Malayalam, and Persian-Arabic. The LDC-IL speech data is collected from the regions of North Goa, South Goa, Karwar (Karnataka) and Sindhudurgh (Maharastra) from both genders and different age groups.Approximately 15 to 20 minutes of speech (per speaker) taken from 267 female and 237 male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 504 (267 Female and 237 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 477 49:52:09 Creative Text 480 22:09:05 Sentence 12,050 15:51:11 Date Format 953 01:50:39 Command and Control Words 14,944 16:11:02 Person Name 9,588 15:55:43 Place Name 4,812 05:31:03 Most Frequent Word - Part 16,376 16:03:13 Most Frequent Word - Full Set 5,998 05:55:07 Phonetically Balanced 2,975 02:49:36 Form and Function - Word 4,285 04:29:03 A detailed explanation of the Konkani Speech Corpus will be available in the Konkani Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Saurabh Varik & Rashmi Shet Tanawade. 2019. Konkani Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Konkani Sentence Aligned Speech Corpus

requests (13)

Dataset Description: 83:19:42 hours | 53.5 GB | 34,091 Audio Segments | 487 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Konkani Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 83:19:42 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 259 female and 228 male native Konkani speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Konkani Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Saurabh Varik, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Konkani Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-62-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Maithili Raw Speech Corpus

requests (20)

78:45:33 Hours | 49.2 GB | 306 Speakers | 45,198 Audio Segments | 48 kHz | 16 bit wavMaithili is an Indio-Aryan language, a direct descendant of Sanskrit, which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled languages of India. The LDC-IL speech data is collected from geographic dialects of Sotipura, Bajjika and Thethi dialects. It is collected from both genders and of different age groups.The available Speech Corpus details:Total Speakers 306 (150 Female and 156 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)29122:33:41Creative Text29415:34:55Sentence7,45107:08:48Date Format58500:31:41Command and Control Words8,92407:07:34Person Name5,91707:49:33Place Name2,95202:47:49Most Frequent Word - Part8,69906:56:24Most Frequent Words-FullSet5,99604:58:30Phonetically Balanced Words3,04002:26:27Form and Function Words1,04900:50:11 A detailed explanation of the Maithili Speech Corpus will be available in the Maithili Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh, Dinesh Mishra & Atuleshwar Jha. 2019. Maithili Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Maithili Raw Speech Corpus Vol. II

requests (3)

109:09:50 hours | 206 Audio Segments | 122 SpeakersThe LDC-IL Maithili Raw Speech dataset Vol.II comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 109:09:50 (hh:mm:ss) , consisting of read speech with continuous text, and spontaneous speech along with the its transcription in Devnagari. The data is derived from 49 female and 73 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Maithili Raw Speech Documentation.For any research-based citations, please use the following citations:Shantanu Kumar, Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Maithili Raw Speech Corpus Vol. II. Central Institute of Indian Languages, Mysore. 978-93-48633-37-8. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Maithili Sentence Aligned Speech Corpus

requests (10)

Dataset Description: 41:54:30 hours | 26 GB | 21,412 Audio Segments | 300 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Maithili Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 41:54:30 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 147 female and 153 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Maithili Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Shantanu Kumar, Dinesh Mishra, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Maithili Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-96-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Maithili Sentence Aligned Speech Corpus (Tirhuta Script)

requests (2)

41:54:30 hours | 26 GB | 21,412 Audio Segments | 300 speakers The LDC-IL Maithili Sentence Aligned Speech Corpus(Tirhuta Script) dataset comprises audio files in wav format, accompanied by a corresponding textual layer containingphonetically normalized and orthographically normalized annotations inTirhuta Script. This dataset spans a duration of 41:54:30(hh:mm:ss) , consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 147 female and 153 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can befound in the The LDC-IL Maithili Sentence Aligned Speech Corpus(Tirhuta Script) Documentation.For any research-based citations, please use the following citations: Dinesh Mishra, Shantanu Kumar, Dr. Narayan Kumar Choudhary, Rajesha N., Prof. Shailendra Mohan. Maithili Sentence Aligned Speech Corpus(Tirhuta Script). Central Instituteof Indian Languages, Mysore. 978-93-48633-51-4Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0..