Bengali Raw Speech Corpus

Bengali Raw Speech Corpus

0 reviews requests (2)
Catalogue Number: 1107
Stock In Stock

OverView

128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. 
Please Login to see the price

Dataset Description

128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. 

Konkani belongs to the Indo-European family of languages. Konkani is the official language of Goa. However, the language is spoken widely across four states- Maharashtra, Goa, Karnataka and Kerala. Konkani is the only Indian language written in five different scripts - Devanagari, Roman, Kannada, Malayalam, and Persian-Arabic. 

The LDC-IL speech data is collected from the regions of North Goa, South Goa, Karwar (Karnataka) and Sindhudurgh (Maharastra) from both genders and different age groups.Approximately 15 to 20 minutes of speech (per speaker) taken from 267 female and 237 male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.


The available Speech Corpus details:

Total Speakers 476 (236 Female and 240 Male)


Domains

Audio Segments

Each Domain

Duration

Contemporary Text (News)

450

35:05:07

Creative Text

448

20:16:13

Sentence

11,239

16:05:22

Date Format

414

0:26:48

Command and Control Words

13,477

14:00:24

Person Name

9,012

4:56:22

Place Name

4,498

1:45:35

Most Frequent Word - Part

13,525

13:33:14

Most Frequent Word - Full Set

5,978

6:47:05

Phonetically Balanced

9,489

10:23:08

Form and Function - Word

4,940

5:27:41


A detailed explanation of the Bengali Speech Corpus will be available in the Bengali Speech Data Documentation. 

For any research-based citations, please use the following citations: 

  • Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta,  Sankarshan Dutta & Priyanka Das. 2019. Bengali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019.LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174.

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta,Sankarshan Dutta, Priyanka Das
  • Corpus Type Raw Corpus
  • Catalogue Number 1107
  • ISBN 978-81-7343-206-4
  • Data Source On Field
  • Duration 128:46:59
  • # of Audio Segments 73399
  • Release Date 04/04/2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User

Write a review

Please login or register to review