Raw Corpus

Raw Speech Corpus

Raw Corpus for Speech

Quickview

Assamese Raw Speech Corpus

requests (17)

54:21:12 Hours | 32.5 GB | 304 Speakers | 37,570 Audio Segments | 48 kHz | 16 bit wav. Assamese is the official language of Assam. Its linguistic presence is widely presented in the state of Assam and some parts of Arunachal Pradesh and Nagaland.According to 2011 census, the Assamese Language is spoken by 15 million speakers.Assamese a widely spoken language does encounter several dialectal variations. The regional dialects can be broadly divided into two parts - the Eastern Group and the Western Group.LDC-IL divided the Assamese speaking areas into these four regions Xiboxagoria, Central Assam, Kamrupi, Goalparia and have collected speech data from each speaker. LDC-IL Assamese Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 304 (154 Female and 150 Male)DomainsAudio SegmentsEach DomainDurationContemporary Text (News)30417:23:25Creative Text30411:44:37Sentence75935:55:29Date Format5990:33:59Command and Control Words91184:56:49Person Name60815:38:07Place Name30441:58:33Phonetically Balanced-W465673:41:45Form and Function-Word-W539602:28:28A detailed explanation of the Assamese Speech Corpus will be available in the Assamese Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, Jahnobi Kalita, Samhita Bharadwaj, Plabita Bora, Priyanshee Adhyapak, Mustafiza Tamim, Rajesha N., Manasa G.. 2021. Assamese Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Bengali Raw Speech Corpus

requests (19)

128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Odisha and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there.LDC-IL Bengali Speech data is collected from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 476 (236 Female and 240 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 450 35:05:07 Creative Text 448 20:16:13 Sentence 11,239 16:05:22 Date Format 414 0:26:48 Command and Control Words 13,477 14:00:24 Person Name 9,012 4:56:22 Place Name 4,498 1:45:35 Most Frequent Word - Part 13,525 13:33:14 Most Frequent Word - Full Set 5,978 6:47:05 Phonetically Balanced 9,489 10:23:08 Form and Function - Word 4,940 5:27:41 A detailed explanation of the Bengali Speech Corpus will be available in the Bengali Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta, Sankarshan Dutta & Priyanka Das. 2019. Bengali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Bodo Raw Speech Corpus

requests (14)

176:53:28 hours of 113 GB | 456 Speakers | 77443 Audio segments | 48 kHz | 16 bit wavBodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic family. It is the language of Bodos, which are the major tribes of the Indian State of Assam.Bodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic family. It is the language of Bodos, which are the major tribes of the Indian State of Assam. The LDC-IL Bodo speech data is collected from the regions of Chirang, Baksa Sonitpur Udalguri, Kamrup, Barpeta, Udalguri, Kokrajhar districts of Assam State of India which covers Bwrdwnari, Eastern, and Standard dialects. The data is collected from both the genders and different age groups.The available Speech Corpus details:Total Speakers 456 (220 Female and 236 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 411 53:47:56 Creative Text 413 26:47:07 Sentence 10,257 09:16:54 Date Format 938 01:58:08 Command and Control Words 12,348 14:19:32 Person Name 8,222 14:49:44 Place Name 4,115 05:17:14 Most Frequent Word - Part 12,397 14:34:05 Most Frequent Word - Full Set 6,994 04:30:14 Phonetically Balanced 15,999 20:07:33 Form and Function - Word 6,383 08:28:25 A detailed explanation of the Bodo Speech Corpus will be available in the Bodo Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Bridul Basumatary & Farson Daimary. 2019. Bodo Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Chhattisgarhi Raw Speech Corpus

requests (8)

Dataset Description: 138:09:27 Hours | 88.9 GB | 140 Speakers | 359 Audio Segments | 48 kHz | 16 bit wav LDC-IL has taken a positive step in its approach towards the mother tongues spoken in India, which is an indication of greater efforts to support and promote linguistic variety in the nation. Collection of Chhattisgarhi speech data is a major effort in this approach. This step towards developing language technology for Indian mother tongues will contribute to the overall enrichment and empowerment of mother tongues.The Chhattisgarhi raw speech corpus is made up of recordings of native Chhattisgarhi speakers from various parts of the state of Chhattisgarh, and it represents a wide range of Chhattisgarhi varieties as they are spoken in various locations by diverse speakers. Each speaker from various age groups recites prompt text extracts of literary and news texts. Along with this, Spontaneous Speech has also been collected.A detailed explanation of the Chhattisgarhi Raw Speech Corpus will be available in the Chhattisgarhi Raw Speech Data Documentation. For any research-based citations, please use the following citations: 1. Satyaendra Kumar Awasthi, Ankita Tiwari, Narayan Kumar Choudhary. 2023. Chhattisgarhi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.2. Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Dogri Raw Speech Corpus

requests (16)

17:10:26 Hours | 11 GB speech data | 61 Speakers | 12,036 Audio segments | 48 kHz | 16 bit wav. Dogri, the language of the Dogras, belongs to the Indo-Aryan group and is the first major language of the multi-lingual region i. e. Jammu of the Jammu & Kashmir state. It derives its name from ‘Duggar’ the ancient title of this region. Dogri is a morphologically rich language having the pre-dominant word order of Subject-Object-Verb (SOV) with a flexibility to rearrange the constituents as many Indian languages allow. Dogri had its own script namely “Dogare Akkhar”or “Dogare” based on Takri script which is closely related to the Sharada script employed by Kashmiri language. This script was the official language script during the regime of Maharaja Ranbir Singh (1857-1885 AD). After the independence, the state government constituted a committee on 29th October, 1953 headed by Sh. Girdhari Lal Dogra. The committee presented a report and accordingly the state government decided to adopt Devanagari as well as Persian script for Dogri and it was incorporated in the State Constitution in 1957. The LDC-IL speech data is collected from Jammu, from both the genders and different age groups. The LDC-IL Dogri Speech data set consists of different types of datasets that are made up of words, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 61 (30 Female and 31 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 60 4:27:51 Creative Text 61 2:51:42 Sentence 1527 1:24:48 Date Format 122 0:14:07 Command and Control Words 1830 1:24:31 Person Name 1222 1:23:41 Place Name 609 0:29:10 Most Frequent Word - Part 1831 1:18:06 Most Frequent Word - Full Set 2000 1:16:27 Phonetically Balanced 2050 1:50:38 Form and Function - Word 724 0:29:25 A detailed explanation of the Dogri Speech Corpus will be available in the Dogri Raw Speech Documentation. For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Sunil Kumar Choudhary, Rajesha N.,ManasaG., 2021. Dogri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Gujarati Raw Speech Corpus

requests (16)

57:17:08 Hours | 37 GB | 204 Speakers| 25,712 Audio Segments | 48 kHz | 16 bit wav. Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra.LDC-IL has 57:17:08 hours Gujarati raw speech data. The LDC-IL Gujarat Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 96 female and 108 male from Gujarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 204 (96 Female and 108 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)20415:21:28Creative Text20211:34:29Sentence50815:48:32Date4040:41:39Command and Control Words60067:17:22Person Name40796:36:02Place Name20412:33:20Most Frequent Word - Part42365:18:47Most Frequent Word – Full Set20001:13:39Phonetically Balanced13780:51:50A detailed explanation of the Gujarati Raw Speech Corpus will be available in the Gujarati Raw Speech Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Hiren Gadhavi R, Solanki Mahesh kumar R, Rejitha K. S., Rajesha N., Manasa, G.., 2021. Gujarati Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Gujarati Raw Speech Corpus(Mono Recordings)

requests (14)

64:44:02 Hours | 7.1 GB | 233 Speakers| 26,223 Audio Segments | 16 kHz | 16 bit wav. Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra. LDC-IL has 64:44:02 hours Gujarati raw speech data as Mono recording. The LDC-IL Gujarati Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 124 female and 109 male from Guajarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 233 (124 Female and 109 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 233 12:52:46 Creative Text 232 13:30:15 Sentence 5824 7:12:17 Date Format 466 0:59:31 Command and Control Words 6985 9:43:07 Person Name 4644 8:34:44 Place Name 2322 3:17:06 Phonetically Balanced 4131 6:28:15 Form and Function - Word 1386 2:06:01 A detailed explanation of the Gujarati Raw Speech Corpus (Mono Recordings) will be available in the Gujarati Raw Speech (Mono Recordings) Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Rejitha KS, Rajesha N., Manasa, G.2021. Gujarati Raw Speech Corpus(Mono Recordings). Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Hindi Raw Speech Corpus

requests (39)

121:00:06 Hours | 76.6 GB | 488 Speakers | 70686 Audio Segments | 48 kHz | 16 bit wav.Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh. The LDC-IL speech data is collected from the regions of Awadhi belt, Bhojpuri belt, Magahi belt and Khariboli belt from both the genders and different age groups. LDC-IL Hindi speech data has 121:00:06 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 488 (234 Female and 254 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 457 37:22:29 Creative Text 463 29:24:08 Sentence 10173 8:41:17 Date Format 764 0:46:56 Command and Control Words 12284 8:34:51 Person Name 8171 9:55:25 Place Name 4085 3:14:44 Most Frequent Word - Part 12315 8:09:10 Most Frequent Word - Full Set 6994 4:30:14 Phonetically Balanced 11986 8:23:43 Form and Function - Word 2994 1:57:09 A detailed explanation of the Hindi Speech Corpus will be available in the Hindi Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra, Arimardan Kumar Tripathi & Satyaendra Kumar Awasthi. 2019. Hindi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Kannada Raw Speech Corpus

requests (29)

179:32:52 hours of 115 GB | 656 Speakers | 99109 Audio segments | 48 kHz | 16 bit wavKannada is one of the Ancient Indian languages belong to the Dravidian family. It has its own script. The language in a region is influenced by other languages of the region, the mother tongue of the speaker, etc. The reading speed, loudness, frequency etc also differ depending on certain factors like age, gender, etc. Linguistic data consortium identified four regional dialects and collected the speech corpus through fieldwork. This read data is collected from various age groups, of male and female native speakers in equal numbers. This data includes Texts, Sentences, Date Formats, and different wordlists. The available Speech Corpus details: Total Speakers - 656 (328 Female and 328 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 600 66:06:09 Creative Text 600 33:09:20 Sentence 14,887 13:58:15 Date Format 1,200 1:16:22 Command and Control Words 17,988 12:31:43 Person Name 12,009 13:04:49 Place Name 6,032 4:48:42 Most Frequent Word - Part 18,065 12:21:24 Most Frequent Word - Full Set 8,000 02:08:58 Phonetically Balanced 9,360 02:40:58 Form and Function - Word 10,368 03:14:38 A detailed explanation of the Kannada Speech Corpus will be available in the Kannada Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N. Abhyankar, Rajesha N. & Manasa G. 2019. Kannada Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Kashmiri Raw Speech Corpus

requests (17)

28:10:07 Hours | 18 GB speech data | 150 Speakers | 16,380 Audio segments | 48 kHz | 16 bit wav. Kashmiri Language belongs to Dardic group of Indo-Aryan family. It is known by names ‘Kashur’ and‘Kashmiri’. It is primarily spoken in Kashmir valley and Pir-Panchal range of Jammu region. Kashmiri language has two types of dialects i.e., regional dialects and social dialects. Apart from the Kashmiri spoken in valley itself there are other varieties of language that are spoken outside the valley and those varieties are considered as regional dialects of Kashmiri language. These regional dialects consist of Kishtawari, Poguli and Rambani. Kashmiri language has three social dialects as well which are known by the names Yamraz, Marak and Kamraz. The LDC-IL speech data is collected from Kashmiri Valley are from Pulwama, Srinagar, and Anantnag. This data is collected from both the genders at different age groups. The LDC-IL Kashmiri Speech data consists of different types of datasets that are made up of words, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 150 (78 Female and 72 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 147 3:56:57 Creative Text 148 12:41:33 Sentence 3704 2:40:24 Date Format 281 0:10:36 Command and Control Words 4288 3:04:32 Person Name 2065 1:53:21 Place Name 1468 1:04:37 Most Frequent Word - Part 4279 2:38:07 A detailed explanation of Kashmiri Speech Corpus will be available in the Kashmiri Speech Data Documentation. For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Shahid Mushtaq Bhatt, Rajesha N., Manasa G., 2021. Kashmiri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Konkani Raw Speech Corpus

requests (18)

156:37:51 Hours | 100 GB | 504 Speakers | 72,938 Audio Segments | 48 kHz | 16 bit wav. Konkani belongs to the Indo-European family of languages. Konkani is the official language of Goa. However, the language is spoken widely across four states- Maharashtra, Goa, Karnataka and Kerala. Konkani is the only Indian language written in five different scripts - Devanagari, Roman, Kannada, Malayalam, and Persian-Arabic. The LDC-IL speech data is collected from the regions of North Goa, South Goa, Karwar (Karnataka) and Sindhudurgh (Maharastra) from both genders and different age groups.Approximately 15 to 20 minutes of speech (per speaker) taken from 267 female and 237 male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 504 (267 Female and 237 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 477 49:52:09 Creative Text 480 22:09:05 Sentence 12,050 15:51:11 Date Format 953 01:50:39 Command and Control Words 14,944 16:11:02 Person Name 9,588 15:55:43 Place Name 4,812 05:31:03 Most Frequent Word - Part 16,376 16:03:13 Most Frequent Word - Full Set 5,998 05:55:07 Phonetically Balanced 2,975 02:49:36 Form and Function - Word 4,285 04:29:03 A detailed explanation of the Konkani Speech Corpus will be available in the Konkani Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Saurabh Varik & Rashmi Shet Tanawade. 2019. Konkani Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Maithili Raw Speech Corpus

requests (20)

78:45:33 Hours | 49.2 GB | 306 Speakers | 45,198 Audio Segments | 48 kHz | 16 bit wavMaithili is an Indio-Aryan language, a direct descendant of Sanskrit, which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled languages of India. The LDC-IL speech data is collected from geographic dialects of Sotipura, Bajjika and Thethi dialects. It is collected from both genders and of different age groups.The available Speech Corpus details:Total Speakers 306 (150 Female and 156 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)29122:33:41Creative Text29415:34:55Sentence7,45107:08:48Date Format58500:31:41Command and Control Words8,92407:07:34Person Name5,91707:49:33Place Name2,95202:47:49Most Frequent Word - Part8,69906:56:24Most Frequent Words-FullSet5,99604:58:30Phonetically Balanced Words3,04002:26:27Form and Function Words1,04900:50:11 A detailed explanation of the Maithili Speech Corpus will be available in the Maithili Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh, Dinesh Mishra & Atuleshwar Jha. 2019. Maithili Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Maithili Raw Speech Corpus Vol. II

requests (3)

109:09:50 hours | 206 Audio Segments | 122 SpeakersThe LDC-IL Maithili Raw Speech dataset Vol.II comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 109:09:50 (hh:mm:ss) , consisting of read speech with continuous text, and spontaneous speech along with the its transcription in Devnagari. The data is derived from 49 female and 73 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Maithili Raw Speech Documentation.For any research-based citations, please use the following citations:Shantanu Kumar, Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Maithili Raw Speech Corpus Vol. II. Central Institute of Indian Languages, Mysore. 978-93-48633-37-8. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Malayalam Raw Speech Corpus

requests (21)

164:01:02 Hours | 105 GB | 458Speakers| 43670 Audio Segments |48 kHz | 16 bit wav.Malayalam is the official language of Kerala and Laccadive Islands. It belongs to the Dravidian language family. According to the formation of Kerala and the language of Travancore, Cochin, and Malabar regions are influenced by different internal and external factors so LDC-IL considered Malayalam has three specifically different varieties, thus collected speech data from Thiruvananthapuram, Ernakulam, and Kozhikode. LDC-IL has 164 hours Malayalam speech data. The LDC-IL Malayalam Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 231 female and 227 Male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 458(231 Female and 227 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 449 71:29:21 Creative Text 449 54:41:20 Sentence 7,452 06:56:46 Date Format 598 00:53:45 Command and Control Words 8,923 07:09:37 Person Name 5,819 05:26:33 Place Name 2,906 02:28:24 Most Frequent Word - Part 8,763 06:51:31 Most Frequent Word - Full Set 1,979 02:08:58 Phonetically Balanced 3,096 02:40:09 Form and Function - Word 3,236 03:14:38 A detailed explanation of the Malayalam Speech Corpus will be available in the Malayalam Speech Data Documentation.For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Saritha S.L., Rejitha K.S., Sajila S. & Midhun P. G. 2019. Malayalam Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Manipuri Raw Speech Corpus

requests (13)

156:28:32 hours | 100 GB | 620 Speakers | 66,231 Audio segments | 48 khz | 16 bit wav Manipuri is the Administrative Language of Manipur. The development of LDC-IL Speech Data for Manipuri lies in capturing all the distinctive characteristics of speeches shared by different regional dialects of Manipur. In order to do so, certain linguistic features identifying regional tones and intonations, phonemic distributions, various pronunciations reflected in both regional and non-regional vocabulary items such as person names and place names etc., have been well housed based on a standard parameter of the dataset. Out of the entire dataset, each specific subset to be read by the corresponding speaker is randomly generated for ‘a read speech corpus’. In this way, each random set is read by a speaker. Limited Full Sets are made read completely by assured selected speakers in each age group. The data is collected from three regional dialects, namely Imphal, Kakching, and Awang Sekmai respectively through fieldwork. The age group ranges selected for fieldwork are ‘16 to 20’, ‘21 to 50’, and ‘above 50 years’ respectively. Equal number of male and female data is collected from each age group. The available Speech Corpus details : Total Speakers620(310 Female and 310 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 530 59:47:22 Creative Text 588 53:59:03 Sentence 10,979 10:01:41 Date Format 866 01:12:04 Command and Control Words 13,129 08:00:02 Person Name 8,789 07:14:04 Place Name 4,394 02:46:29 Most Frequent Word - Part 13,167 06:48:50 Most Frequent Word - Full Set 6,992 02:48:42 Phonetically Balanced 4,518 02:25:53 Form and Function - Word 2,279 01:23:50 A detailed explanation of the Manipuri Speech corpus will be available in the Manipuri Raw Speech Corpus Documentation.For any research based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu & Longjam Anand Singh. 2019. Manipuri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...