Search - Tag - Indian

Quickview

Indian English-Bengali Sentence Aligned Speech Corpus

requests (4)

Dataset Description:09:21:08 hours | 5.53 GB | 5,676 Audio Segments | 52 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Indian English-Bengali variant Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing Roman script. This dataset spans a duration of 09:21:08 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 26 female and 26 male native Bengali speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Indian English-Bengali variant Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Poulami Das, Rajesha N., Manasa G., Srikanth D., Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Indian English-Bengali variant Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-43-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Indian English-Kannada Sentence Aligned Speech Corpus

requests (5)

Dataset Description:11:17:40 hours | 7.27 GB | 6,166 Audio Segments | 53 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Indian English-Kannada variant Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing Roman script. This dataset spans a duration of 11:17:40 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 26 female and 27 male native Kannada speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Indian English-Kannada variant Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Vijayalaxmi F. Patil, Rajesha N., Manasa G., Srikanth D., Nithin S.,Narayan Kumar Choudhary, Shailendra Mohan. 2023 Indian English-Kannada variant Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-35-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Indian English Raw Speech Corpus - Bengali Variant

requests (8)

25:47:11 Hours | 15.5 GB | 53 Speakers| 16,044 Audio Segments | 48 kHz | 16 bit wav.English language is a blend of Anglo-Saxon which is the prominent language of Britain in middle ages. It has been propagated to every corner of the world by colonists. English emerges as the most visible legacy of British in India because India was under British raj for almost two centuries and English is a part of education system here. Most of the states in India use their regional languages and do not have a common language to communicate. So English is used for inter-state communication.LDC-IL has 25 hours Indian English - Bengali Variant speech data. The LDC-IL Indian English Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 27 female and 26 Male from Bengali mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 53 (27 Female and 26 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 52 6:03:15 Creative Text 52 2:41:17 Sentence 1300 1:29:35 Date Format 104 0:08:56 Command and Control Words 2882 3:09:13 Person Name 1040 0:33:56 Place Name 519 1:30:22 Most Frequent Word - Part 1442 1:22:38 Most Frequent Word - Full Set 5985 6:01:44 Phonetically Balanced 1782 1:52:21 Form and Function - Word 886 0:53:54 A detailed explanation of the Indian English Raw Speech Corpus - Bengali Variant will be available in the Indian English Raw Speech Corpus - Bengali Variant Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Arundhati Sengupta, Rejitha KS, Rajesha N., Manasa, G., 2021. Indian English Raw Speech Corpus - Bengali Variant. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...