Bengali Raw Speech Corpus
OverView
128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav.Your request cart is empty!
Dataset Description
128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav.
Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Odisha and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there.LDC-IL Bengali Speech data is collected from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.
The available Speech Corpus details:
Total Speakers 476 (236 Female and 240 Male)
Domains |
Audio
Segments |
Each Domain Duration |
Contemporary Text (News) |
450 |
35:05:07 |
Creative Text |
448 |
20:16:13 |
Sentence |
11,239 |
16:05:22 |
Date Format |
414 |
0:26:48 |
Command and Control Words |
13,477 |
14:00:24 |
Person Name |
9,012 |
4:56:22 |
Place Name |
4,498 |
1:45:35 |
Most Frequent Word - Part |
13,525 |
13:33:14 |
Most Frequent Word - Full
Set |
5,978 |
6:47:05 |
Phonetically Balanced |
9,489 |
10:23:08 |
Form and Function - Word |
4,940 |
5:27:41 |
A
detailed explanation of the Bengali Speech Corpus will be available in the Bengali Speech Data Documentation.
For any research-based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta, Sankarshan Dutta & Priyanka Das. 2019. Bengali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian
Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
Item specifics
- Authors Ramamoorthy L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta,Sankarshan Dutta, Priyanka Das
- Corpus Type Raw Corpus
- Catalogue Number 1107
- ISBN 978-81-7343-206-4
- Data Source On Field
- Duration 128:46:59
- # of Audio Segments 73470
- Release Date 04-Apr-2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.