Your request cart is empty!
Dataset Description
28:10:07 Hours | 18 GB speech data | 150 Speakers | 16,380 Audio segments | 48 kHz | 16 bit wav.
Kashmiri Language belongs to Dardic group of Indo-Aryan family. It is known by names ‘Kashur’ and‘Kashmiri’. It is primarily spoken in Kashmir valley and Pir-Panchal range of Jammu region. Kashmiri language has two types of dialects i.e., regional dialects and social dialects. Apart from the Kashmiri spoken in valley itself there are other varieties of language that are spoken outside the valley and those varieties are considered as regional dialects of Kashmiri language. These regional dialects consist of Kishtawari, Poguli and Rambani. Kashmiri language has three social dialects as well which are known by the names Yamraz, Marak and Kamraz.
The LDC-IL speech data is collected from Kashmiri Valley are from
Pulwama, Srinagar, and Anantnag. This data is collected from both the genders at
different age groups. The
LDC-IL Kashmiri Speech data consists of different types of datasets that are
made up of words, sentences, running texts and date formats. Each speaker
recorded these datasets which are randomly selected from a master dataset.
The
available Speech Corpus details:
Total Speakers 150 (78 Female and 72
Male)
Domains |
Audio Segments |
Each Domain Duration |
Contemporary Text (News) |
147 |
3:56:57 |
Creative Text |
148 |
12:41:33 |
Sentence |
3704 |
2:40:24 |
Date Format |
281 |
0:10:36 |
Command and Control Words |
4288 |
3:04:32 |
Person Name |
2065 |
1:53:21 |
Place Name |
1468 |
1:04:37 |
Most Frequent Word - Part |
4279 |
2:38:07 |
A detailed explanation of Kashmiri Speech Corpus will be available in the Kashmiri Speech Data Documentation.
For any research-based citations, please use the following citations:
- Narayan Kumar Choudhary, Shahid Mushtaq Bhatt, Rajesha N., Manasa G., 2021. Kashmiri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
Item specifics
- Authors Narayan Kumar Choudhary, Shahid Mushtaq Bhatt, Rajesha N., Manasa G.
- Corpus Type Raw Corpus
- Catalogue Number 1280
- ISBN 978-81-948885-2-9
- Data Source On Field
- Duration 28:10:07
- # of Audio Segments 16380
- Release Date 15-Jun-2021
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.