Hindi Raw Speech Corpus

Hindi Raw Speech Corpus

0 reviews requests (8)
Catalogue Number: 1122
Stock In Stock

OverView

Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India.
Please Login to see the price

Dataset Description

Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India.

LDC-IL Hindi speech data of 118 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.

Approximately 15 minutes of speech (per speaker) taken from 234 female and 255 Male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.

Corpus details:

  • a total of 489 speakers (234 Female and 255 Male.)
  • 73695 audio segments
  • 78.6 gigabytes of WAV files and Metadata Text Files
  • 118:40:03 hours of speech data

A much more detailed explanation of the Hindi Speech Corpus will be available in the Hindi Speech Data Documentation.

Overview

Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand and Uttar Pradesh.

The LDC-IL speech data is collected from the regions of Awadhi belt, Bhojpuri belt and Khariboli belt from both the genders and different age group.

The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences  running texts and date formats.

A much more detailed explanation of the Hindi Speech Corpus will be available in the Hindi Speech Data Documentation.

For any research based citations, please use the following citations:

  • Ramamoorthy, L., Narayan Choudhary,  Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra, Arimardan Kumar Tripathi & Satyaendra Kumar Awasthi. 2019. Hindi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174.

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Satyaendra Awasthi, Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra,Arimardan Kumar Tripathi, Aditi Debsharma
  • Corpus Type Raw Corpus
  • Catalogue Number 1122
  • ISBN 978-81-7343-221-7
  • Data Source On Field
  • Duration 118:40:03
  • # of Audio Segments 73695
  • Release Date 04/04/2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User

Write a review

Please login or register to review