Your request cart is empty!
Dataset Description
121:00:06 Hours | 76.6 GB | 488 Speakers | 70686 Audio
Segments | 48 kHz | 16 bit wav.
Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh. The LDC-IL speech data is collected from the regions of Awadhi belt, Bhojpuri belt, Magahi belt and Khariboli belt from both the genders and different age groups. LDC-IL Hindi speech data has 121:00:06 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.
The available Speech Corpus details:
Total Speakers 488 (234 Female and 254 Male)
Domains |
Audio
Segments |
Each
Domain Duration |
Contemporary Text (News) |
457 |
37:22:29 |
Creative Text |
463 |
29:24:08 |
Sentence |
10173 |
8:41:17 |
Date Format |
764 |
0:46:56 |
Command and Control Words |
12284 |
8:34:51 |
Person Name |
8171 |
9:55:25 |
Place Name |
4085 |
3:14:44 |
Most Frequent Word - Part |
12315 |
8:09:10 |
Most Frequent Word - Full Set |
6994 |
4:30:14 |
Phonetically Balanced |
11986 |
8:23:43 |
Form and Function - Word |
2994 |
1:57:09 |
A detailed explanation of the Hindi Speech Corpus will be available in the Hindi Speech Data Documentation.
For any research-based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Jitendra
Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra, Arimardan Kumar
Tripathi & Satyaendra Kumar Awasthi. 2019. Hindi Raw Speech
Corpus. Central Institute
of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
Item specifics
- Authors Ramamoorthy L., Narayan Choudhary, Satyaendra Awasthi, Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra,Arimardan Kumar Tripathi, Aditi Debsharma
- Corpus Type Raw Corpus
- Catalogue Number 1122
- ISBN 978-81-7343-221-7
- Data Source On Field
- Duration 121:00:06
- # of Audio Segments 70686
- Release Date 04-Apr-2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.