Search - Tag - Punjabi

Quickview

A Gold Standard Punjabi Raw Text Corpus

requests (17)

1,01,25,770 Words | 2,470 Tittles | XML format | 5 domainsPunjabi is the principal and administrative language of Punjab. Punjabi is not only spoken in Punjab but in India, it is also a language of Lehnda Punjab in Pakistan. Punjabi is an Indo-Aryan language. This same the Punjabi language is being written in two epigraphs, in Gurmukhi script and Shahmukhi script. In our Eastern Punjabi, it is being used in Gurmukhi and Lehnda Punjab (Pakistan) using Shahmukhi script. Punjabi is written in Shahmukhi scripts as well. ‘Shahmukhi’ is a variant of ‘Perso-Arabic’ script. But LDC-IL Punjabi text corpus is collected in the Gurmukhi script for contemporary usage. Punjabi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. LDC-IL Punjabi Text Corpus size is 1, 01, 25,770 words drawn from 2,470 different titles. The five major domains are Aesthetics, Science & Technology, Social Science, Commerce and Mass Media.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 41,90,199 41.38 % Commerce 56,205 00.56 % Mass Media 42,74,922 42.22 % Science and Technology 3,84,078 03.79 % Social Sciences 12,20,366 12.05 % A detailed explanation of the Punjabi Text Corpus will be available in the Punjabi Raw Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon, Sarbjeet Kaur & Sandeep Singh. 2019. A Gold Standard Punjabi Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Punjabi Raw Speech Corpus

requests (14)

101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230 Audio Segments | 48 kHz | 16 bit wav. Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. Here we are talking about only Indian Gurmukhi Punjabi. The Punjabi language has four different dialects, spoken in the different sub-regions of Punjab. The LDC-IL Punjabi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. LDC-IL collected speech data from Malwa, Doab and Puadh regions.The available Speech Corpus details:Total Speakers 467(234 Female and 233 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 448 27:07:41 Creative Text 446 19:29:15 Sentence 11,168 08:58:33 Date Format 887 00:27:53 Command and Control Words 13,274 07:49:16 Person Name 8,949 10:28:40 Place Name 4,473 03:17:02 Most Frequent Word - Part 8,889 05:21:56 Most Frequent Word - Full Set 3,988 02:52:44 Phonetically Balanced 13,939 08:56:04 Form and Function - Word 9,769 06:24:07 A detailed explanation of the Punjabi Speech Corpus will be available in the Punjabi Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon & Sarbjeet Kaur. 2019. Punjabi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Punjabi Sentence Aligned Speech Corpus

requests (1)

52:24:51 hours | 34:8 GB | 31,338 Audio Segments | 449 SpeakersThe LDC-IL Punjabi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Gurmukhi script. This dataset spans a duration of 52:24:51 (hh:mm:ss) , consisting of read speech with continuous text, representative sentences, and date formats. A comprehensive explanation of dataset can be found in the Punjabi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:Dr. Shalinder Singh, Rajesha N., Manasa G., Stephen Fernandes, Nithin S., Roopashri M. R., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Punjabi Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-69-9.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...