Punjabi Raw Speech Corpus
OverViewPunjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral).
Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral).
101:09:28 hours of Punjabi speech data | 76,240 audio segments | 467 speakers | 65.5 GB | 48 kHz | 16 bit wav
LDC-IL Punjabi speech data of 101 hours. The LDC-IL Punjabi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.
Speech recordings taken from 234 female and 233 Male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.
- Total of 467 speakers (234 Female and 233 Male.)
- Contemporary Text (News) - 448 Audio Segments - 27:07:41 hours
- Created Text - 446 Audio Segments - 19:29:15 hours
- Date - 887 Audio Segments - 00:27:53 hours
- Sentence– 11,168 Audio Segments - 08:58:33 hours
- Command and Control Words– 13,274 Audio Segments - 07:49:16 hours
- Place Name– 4,473 Audio Segments - 03:17:02 hours
- Person Names – 8,949 Audio Segments - 10:28:40 hours
- Most Frequent Word-Part– 8,889 Audio Segments - 05:21:56 hours
- Most Frequent Word-Full– 3,988 Audio Segments - 02:52:44 hours
- Phonetically Balanced Vocabulary– 13,939 Audio Segments - 08:56:04 hours
- Form and Function Word– 9,779 Audio Segments - 06:24:07 hours
Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. Here we are talking about only Indian Gurmukhi Punjabi. Punjabi language has four different dialects, spoken in the different sub-regions of Punjab.
LDC-ILcollected speech data from Malwa, Doab and Puadh regions.
A much more detailed explanation of the Punjabi Speech Corpus will be available in the Punjabi Speech Data Documentation.
For any research based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon & Sarbjeet Kaur. 2019. Punjabi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: ” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
- Authors Ramamoorthy L., Narayan Choudhary, Poonam Dhillon, Sarbjeet Kaur
- Corpus Type Raw Corpus
- Catalogue Number 1165
- ISBN 978-81-7343-264-4
- Data Source On Field
- Duration 101:09:28
- # of Audio Segments 76240
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.