99:18:21 hours, 64.2 Gigabytes of speech data | 499 Speakers | 88,708 Audio Segments | 48 kHz | 16 bit wav
is one of the Modern Indo-Aryan languages of India. It evolved from Shaurseni
Apabhramsha. It uses Persio-Arabic script. The language in a region is
influenced by other languages of the region, mother tongue of the speaker, etc.
The reading speed, loudness, frequency etc. also differ depending on certain
factors like age, gender etc. Linguistic data consortium collected the speech
corpus through fieldwork. This read data is collected from various age groups
of male and female native speakers. This data includes Texts, Sentences, Date
Formats, and different wordlists.
The available Speech Corpus details are as
- Total Speakers - 499 (252 Female and 247 Male)
- News - 431 Audio Segments - 25:35:02 Hours
- Creative Text - 433 Audio Segments - 19:40:11 Hours
- Sentence - 10646 Audio Segments - 8:00:38 Hours
- Date - 846 Audio Segments - 0:43:37 Hours
- Command and Control Words - 13580 Audio Segments - 9:21:01 Hours
- Person Name - 6577 Audio Segments - 2:55:41 Hours
- Place Name - 4273 Audio Segments - 1:09:17 Hours
- Most Frequent Word (Part) - 12802 Audio Segments - 7:46:28 Hours
- Most Frequent Word (Full) - 18927 Audio Segments - 11:38:30 Hours
- Phonetically Balanced Vocabulary - 13646 Audio Segments - 8:13:20 Hours
- Form and Function Word - 6547 Audio Segments - 4:14:36 Hours
A much more detailed explanation of the Urdu Speech Corpus will be available in the Urdu Speech Data Documentation.
For any research based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam & Rushda Idris Khan. 2019. Urdu Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: ” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
- Authors Ramamoorthy L., Narayan Choudhary, Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam, Rushda Idris Khan,
- Corpus Type Raw Corpus
- Catalogue Number 1177
- ISBN 978-81-7343-276-7
- Data Source On Field
- Duration 99:18:21
- # of Audio Segments 88708
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.