Bengali Raw Speech Corpus

Bengali Raw Speech Corpus

0 reviews requests (2)
Catalogue Number: 1107
Stock In Stock

OverView

128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. 
Please Login to see the price

Dataset Description

128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. 

Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Odisha and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there.LDC-IL Bengali Speech data is collected from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.


The available Speech Corpus details:

Total Speakers 476 (236 Female and 240 Male)


Domains

Audio Segments

Each Domain

Duration

Contemporary Text (News)

450

35:05:07

Creative Text

448

20:16:13

Sentence

11,239

16:05:22

Date Format

414

0:26:48

Command and Control Words

13,477

14:00:24

Person Name

9,012

4:56:22

Place Name

4,498

1:45:35

Most Frequent Word - Part

13,525

13:33:14

Most Frequent Word - Full Set

5,978

6:47:05

Phonetically Balanced

9,489

10:23:08

Form and Function - Word

4,940

5:27:41


A detailed explanation of the Bengali Speech Corpus will be available in the Bengali Speech Data Documentation. 

For any research-based citations, please use the following citations: 

  • Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta,  Sankarshan Dutta & Priyanka Das. 2019. Bengali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019.LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174.

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta,Sankarshan Dutta, Priyanka Das
  • Corpus Type Raw Corpus
  • Catalogue Number 1107
  • ISBN 978-81-7343-206-4
  • Data Source On Field
  • Duration 128:46:59
  • # of Audio Segments 73470
  • Release Date 04/04/2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User

Write a review

Please login or register to review