A Gold Standard Punjabi Raw Text Corpus

1,01,25,770 Words |  2,470 Tittles | XML  format | 5 domainsPunjabi is the principal and administrative language of Punjab. Punjabi is not only spoken in Punjab but in India, it is also a language of Lehnda Punjab in Pakistan. Punjabi is an Indo-Aryan language. This same the Punjabi language is being written in two epigraphs, in Gurmukhi script and Shahmukhi script. In our Eastern Punjabi, it is being used in Gurmukhi and Lehnda Punjab (Pakistan) using Shahmukhi script. Punjabi is written in Shahmukhi scripts as well. ‘Shahmukhi’ is a variant of ‘Perso-Arabic’ script. But LDC-IL Punjabi text corpus is collected in the Gurmukhi script for contemporary usage. Punjabi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. LDC-IL Punjabi Text Corpus size is 1, 01, 25,770 words drawn from 2,470 different titles. The five major domains are Aesthetics, Science & Technology, Social Science, Commerce and Mass Media.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics  41,90,199 41.38 % Commerce 56,205 00.56 % Mass Media 42,74,922 42.22 % Science and Technology 3,84,078 03.79 % Social Sciences 12,20,366 12.05 % A detailed explanation of the Punjabi Text Corpus will be available in the Punjabi Raw Text Corpus Documentation.For any research-based citations, please use the following citations:  Ramamoorthy, L., Narayan  Choudhary, Poonam Dhillon, Sarbjeet Kaur & Sandeep  Singh. 2019. A Gold Standard Punjabi Raw Text Corpus.    Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview"  in  Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...


Punjabi Raw Speech Corpus

101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230  Audio Segments | 48 kHz | 16 bit wav. Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. Here we are talking about only Indian Gurmukhi Punjabi. The Punjabi language has four different dialects, spoken in the different sub-regions of Punjab. The LDC-IL Punjabi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. LDC-IL collected speech data from Malwa, Doab and Puadh regions.The available Speech Corpus details:Total Speakers 467(234  Female and 233 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 448 27:07:41 Creative Text 446 19:29:15 Sentence 11,168 08:58:33 Date Format 887 00:27:53 Command and Control Words 13,274 07:49:16 Person Name 8,949 10:28:40 Place Name 4,473 03:17:02 Most Frequent Word - Part 8,889 05:21:56 Most Frequent Word - Full Set 3,988 02:52:44 Phonetically Balanced 13,939 08:56:04 Form and Function - Word 9,769 06:24:07 A detailed explanation of the Punjabi Speech Corpus will be available in the Punjabi Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon & Sarbjeet Kaur. 2019. Punjabi Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

