Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family.
LDC-IL Bengali Speech Data set consists of different types of word list along with sentence list, running text and date format. Approximately 15 minutes of speech (per speaker) has been taken from 223 female and 227 male native speakers with different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. Along with this random set, some full sets are there in the database where the speaker has uttered some full set of words.
- Total number of speakers: 450 random set & 28 full set
- Total audio segments: 73470 audio segments
- Total duration: 128:46:59 hours
- Total volume: 81.2 gigabytes of WAV files and Metadata Txt Files
- Age group: 16 to 20, 21 to 50, 51 above
- Recording mode: .WAV – 16bit
- Sampling frequency: 48.0 Kilohertz
Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Odisha and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there.
LDC-IL Bengali Speech data is collected from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).
LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.
A much more detailed explanation of the Bengali Speech Corpus will be available in the Bengali Speech Data Documentation.
For any research based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta, Sankarshan Dutta & Priyanka Das. 2019. Bengali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw
Speech Corpora: ” in Linguistic Resources for AI/NLP in Indian
Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
- Authors Ramamoorthy L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta,Sankarshan Dutta, Priyanka Das
- Corpus Type Raw Corpus
- Catalogue Number 1107
- ISBN 978-81-7343-206-4
- Data Source On Field
- Duration 128:46:59
- # of Audio Segments 73399
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.