Central Institute of Indian Languages
A Gold Standard Telugu Raw Text Corpus
30,10,993 Words | 859 Titles | XML format | 6 DomainsTelugu is a highly agglutinative and morphologically rich language. The actual pattern of language use in natural texts reveals the evidence of language trait. Government of India set up Linguistic Data Consortium for Indian Languages to help those who endeavor in the language development field. LDC-IL Telugu Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Telugu text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Telugu but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Telugu. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then stored.Telugu Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 1,687,968 56.06 % Commerce 45,130 1.50 % Mass Media 14,656 0.49 % Official Documents 6,708 0.22 % Science and Technology 415,102 13.79 % Social Sciences 841,429 27.95 % A detailed explanation of the Telugu Raw Text Corpus will be available in the Telugu Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Thirupal C Reddy & Gangaraju H. 2019. A Gold Standard Telugu Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...
A Gold Standard Urdu Raw Text Corpus
5161927 Words | 739 Titles | XML format | 5 domains.Urdu is one of the prominent language used in the Indian sub-continent. It belongs to the Indo-Aryan family. Urdu is influenced by Arabic and Persian. Urdu is written in the Perso-Arabic script. On the other hand region-wise Urdu language is co-existed side by side mostly in the northern part of India, north-west and eastern parts of India and Pakistan also, although understood and spoken occasionally in the rest of India. Urdu is arisen in the 10th century A.D. due to occupation relations, ethnic exchanges, relocations, and military expeditions. Urdu in India is basically developed in close contact with Persian, which was the language of administration and education during the period of Muslim rule. LDC-IL Urdu Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Urdu text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Urdu but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Urdu. Data has been collected from books, magazines, and newspapers and it is verified true to the original text.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 2616382 50.69 % Commerce 28601 0.55 % Mass Media 843477 16.34 % Science and Technology 348082 6.74 % Social Sciences 1325385 25.68 % A detailed explanation of the Urdu Text Corpus will be available in the Urdu Text Corpus documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam & Rushda Idris Khan. 2019. A Gold Standard Urdu Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...
Assamese Raw Speech Corpus
54:21:12 Hours | 32.5 GB | 304 Speakers | 37,570 Audio Segments | 48 kHz | 16 bit wav. Assamese is the official language of Assam. Its linguistic presence is widely presented in the state of Assam and some parts of Arunachal Pradesh and Nagaland.According to 2011 census, the Assamese Language is spoken by 15 million speakers.Assamese a widely spoken language does encounter several dialectal variations. The regional dialects can be broadly divided into two parts - the Eastern Group and the Western Group.LDC-IL divided the Assamese speaking areas into these four regions Xiboxagoria, Central Assam, Kamrupi, Goalparia and have collected speech data from each speaker. LDC-IL Assamese Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 304 (154 Female and 150 Male)DomainsAudio SegmentsEach DomainDurationContemporary Text (News)30417:23:25Creative Text30411:44:37Sentence75935:55:29Date Format5990:33:59Command and Control Words91184:56:49Person Name60815:38:07Place Name30441:58:33Phonetically Balanced-W465673:41:45Form and Function-Word-W539602:28:28A detailed explanation of the Assamese Speech Corpus will be available in the Assamese Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, Jahnobi Kalita, Samhita Bharadwaj, Plabita Bora, Priyanshee Adhyapak, Mustafiza Tamim, Rajesha N., Manasa G.. 2021. Assamese Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Assamese Sentence Aligned Speech Corpus
Dataset Description 30:18:16 hours|19.5GB |21,716 Audio Segments |304 speakers The annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Assamese Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Assamese script. This dataset spans a duration of 30:18:16 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 154 female and 150 male native Assamese speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Assamese Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:1. Syeda Mustafiza Tamim, Priyanshe Adhyapak, Rajesha N., Manasa G., Srikanth D.,Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Assamese Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-53-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Bengali Raw Speech Corpus
128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Odisha and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there.LDC-IL Bengali Speech data is collected from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 476 (236 Female and 240 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 450 35:05:07 Creative Text 448 20:16:13 Sentence 11,239 16:05:22 Date Format 414 0:26:48 Command and Control Words 13,477 14:00:24 Person Name 9,012 4:56:22 Place Name 4,498 1:45:35 Most Frequent Word - Part 13,525 13:33:14 Most Frequent Word - Full Set 5,978 6:47:05 Phonetically Balanced 9,489 10:23:08 Form and Function - Word 4,940 5:27:41 A detailed explanation of the Bengali Speech Corpus will be available in the Bengali Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta, Sankarshan Dutta & Priyanka Das. 2019. Bengali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Bengali Sentence Aligned Speech Corpus
Dataset Description:69:10:03 hours | 43.3 GB | 40,240 Audio Segments | 450 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Bengali Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Bengali script. This dataset spans a duration of 69:10:03 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 223 female and 227 male native Bengali speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Bengali Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Sonali Sutradhar, Poulami Das, Rajesha N., Manasa G., Srikanth D.,Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Bengali Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-48-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Bodo Raw Speech Corpus
176:53:28 hours of 113 GB | 456 Speakers | 77443 Audio segments | 48 kHz | 16 bit wavBodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic family. It is the language of Bodos, which are the major tribes of the Indian State of Assam.Bodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic family. It is the language of Bodos, which are the major tribes of the Indian State of Assam. The LDC-IL Bodo speech data is collected from the regions of Chirang, Baksa Sonitpur Udalguri, Kamrup, Barpeta, Udalguri, Kokrajhar districts of Assam State of India which covers Bwrdwnari, Eastern, and Standard dialects. The data is collected from both the genders and different age groups.The available Speech Corpus details:Total Speakers 456 (220 Female and 236 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 411 53:47:56 Creative Text 413 26:47:07 Sentence 10,257 09:16:54 Date Format 938 01:58:08 Command and Control Words 12,348 14:19:32 Person Name 8,222 14:49:44 Place Name 4,115 05:17:14 Most Frequent Word - Part 12,397 14:34:05 Most Frequent Word - Full Set 6,994 04:30:14 Phonetically Balanced 15,999 20:07:33 Form and Function - Word 6,383 08:28:25 A detailed explanation of the Bodo Speech Corpus will be available in the Bodo Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Bridul Basumatary & Farson Daimary. 2019. Bodo Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Chhattisgarhi Raw Speech Corpus
Dataset Description: 138:09:27 Hours | 88.9 GB | 140 Speakers | 359 Audio Segments | 48 kHz | 16 bit wav LDC-IL has taken a positive step in its approach towards the mother tongues spoken in India, which is an indication of greater efforts to support and promote linguistic variety in the nation. Collection of Chhattisgarhi speech data is a major effort in this approach. This step towards developing language technology for Indian mother tongues will contribute to the overall enrichment and empowerment of mother tongues.The Chhattisgarhi raw speech corpus is made up of recordings of native Chhattisgarhi speakers from various parts of the state of Chhattisgarh, and it represents a wide range of Chhattisgarhi varieties as they are spoken in various locations by diverse speakers. Each speaker from various age groups recites prompt text extracts of literary and news texts. Along with this, Spontaneous Speech has also been collected.A detailed explanation of the Chhattisgarhi Raw Speech Corpus will be available in the Chhattisgarhi Raw Speech Data Documentation. For any research-based citations, please use the following citations: 1. Satyaendra Kumar Awasthi, Ankita Tiwari, Narayan Kumar Choudhary. 2023. Chhattisgarhi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.2. Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Dogri Raw Speech Corpus
17:10:26 Hours | 11 GB speech data | 61 Speakers | 12,036 Audio segments | 48 kHz | 16 bit wav. Dogri, the language of the Dogras, belongs to the Indo-Aryan group and is the first major language of the multi-lingual region i. e. Jammu of the Jammu & Kashmir state. It derives its name from ‘Duggar’ the ancient title of this region. Dogri is a morphologically rich language having the pre-dominant word order of Subject-Object-Verb (SOV) with a flexibility to rearrange the constituents as many Indian languages allow. Dogri had its own script namely “Dogare Akkhar”or “Dogare” based on Takri script which is closely related to the Sharada script employed by Kashmiri language. This script was the official language script during the regime of Maharaja Ranbir Singh (1857-1885 AD). After the independence, the state government constituted a committee on 29th October, 1953 headed by Sh. Girdhari Lal Dogra. The committee presented a report and accordingly the state government decided to adopt Devanagari as well as Persian script for Dogri and it was incorporated in the State Constitution in 1957. The LDC-IL speech data is collected from Jammu, from both the genders and different age groups. The LDC-IL Dogri Speech data set consists of different types of datasets that are made up of words, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 61 (30 Female and 31 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 60 4:27:51 Creative Text 61 2:51:42 Sentence 1527 1:24:48 Date Format 122 0:14:07 Command and Control Words 1830 1:24:31 Person Name 1222 1:23:41 Place Name 609 0:29:10 Most Frequent Word - Part 1831 1:18:06 Most Frequent Word - Full Set 2000 1:16:27 Phonetically Balanced 2050 1:50:38 Form and Function - Word 724 0:29:25 A detailed explanation of the Dogri Speech Corpus will be available in the Dogri Raw Speech Documentation. For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Sunil Kumar Choudhary, Rajesha N.,ManasaG., 2021. Dogri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Gujarati Raw Speech Corpus
57:17:08 Hours | 37 GB | 204 Speakers| 25,712 Audio Segments | 48 kHz | 16 bit wav. Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra.LDC-IL has 57:17:08 hours Gujarati raw speech data. The LDC-IL Gujarat Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 96 female and 108 male from Gujarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 204 (96 Female and 108 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)20415:21:28Creative Text20211:34:29Sentence50815:48:32Date4040:41:39Command and Control Words60067:17:22Person Name40796:36:02Place Name20412:33:20Most Frequent Word - Part42365:18:47Most Frequent Word – Full Set20001:13:39Phonetically Balanced13780:51:50A detailed explanation of the Gujarati Raw Speech Corpus will be available in the Gujarati Raw Speech Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Hiren Gadhavi R, Solanki Mahesh kumar R, Rejitha K. S., Rajesha N., Manasa, G.., 2021. Gujarati Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Gujarati Raw Speech Corpus(Mono Recordings)
64:44:02 Hours | 7.1 GB | 233 Speakers| 26,223 Audio Segments | 16 kHz | 16 bit wav. Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra. LDC-IL has 64:44:02 hours Gujarati raw speech data as Mono recording. The LDC-IL Gujarati Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 124 female and 109 male from Guajarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 233 (124 Female and 109 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 233 12:52:46 Creative Text 232 13:30:15 Sentence 5824 7:12:17 Date Format 466 0:59:31 Command and Control Words 6985 9:43:07 Person Name 4644 8:34:44 Place Name 2322 3:17:06 Phonetically Balanced 4131 6:28:15 Form and Function - Word 1386 2:06:01 A detailed explanation of the Gujarati Raw Speech Corpus (Mono Recordings) will be available in the Gujarati Raw Speech (Mono Recordings) Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Rejitha KS, Rajesha N., Manasa, G.2021. Gujarati Raw Speech Corpus(Mono Recordings). Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Hindi Raw Speech Corpus
121:00:06 Hours | 76.6 GB | 488 Speakers | 70686 Audio Segments | 48 kHz | 16 bit wav.Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh. The LDC-IL speech data is collected from the regions of Awadhi belt, Bhojpuri belt, Magahi belt and Khariboli belt from both the genders and different age groups. LDC-IL Hindi speech data has 121:00:06 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 488 (234 Female and 254 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 457 37:22:29 Creative Text 463 29:24:08 Sentence 10173 8:41:17 Date Format 764 0:46:56 Command and Control Words 12284 8:34:51 Person Name 8171 9:55:25 Place Name 4085 3:14:44 Most Frequent Word - Part 12315 8:09:10 Most Frequent Word - Full Set 6994 4:30:14 Phonetically Balanced 11986 8:23:43 Form and Function - Word 2994 1:57:09 A detailed explanation of the Hindi Speech Corpus will be available in the Hindi Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra, Arimardan Kumar Tripathi & Satyaendra Kumar Awasthi. 2019. Hindi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Hindi Sentence Aligned Speech Corpus
Dataset Description: 72:34:52 hours | 45.9 GB | 42,275 Audio Segments | 473 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Hindi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 72:34:52 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 225 female and 248 male native Hindi speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Hindi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Satyaendra Kumar Awasthi, Ankita Tiwari, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Hindi Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-28-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Indian English-Bengali Sentence Aligned Speech Corpus
Dataset Description:09:21:08 hours | 5.53 GB | 5,676 Audio Segments | 52 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Indian English-Bengali variant Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing Roman script. This dataset spans a duration of 09:21:08 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 26 female and 26 male native Bengali speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Indian English-Bengali variant Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Poulami Das, Rajesha N., Manasa G., Srikanth D., Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Indian English-Bengali variant Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-43-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Indian English-Kannada Sentence Aligned Speech Corpus
Dataset Description:11:17:40 hours | 7.27 GB | 6,166 Audio Segments | 53 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Indian English-Kannada variant Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing Roman script. This dataset spans a duration of 11:17:40 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 26 female and 27 male native Kannada speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Indian English-Kannada variant Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Vijayalaxmi F. Patil, Rajesha N., Manasa G., Srikanth D., Nithin S.,Narayan Kumar Choudhary, Shailendra Mohan. 2023 Indian English-Kannada variant Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-35-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..