A Gold Standard Dogri Raw Text Corpus

8,01,771 Words | 183 Tittles |  XML format | 05 Text DomainsDogri is an Indo-Aryan language spoken by about five million people in India and Pakistan, particularly in the Jammu region of Jammu and Kashmir and Himachal Pradesh, also in northern Punjab, other parts of Jammu and Kashmir. Dogri was originally written using the Dogri script which is very close to the Takriscript. The language is now more commonly written in Devanagari in India, and in the Nastaʿliq form of Perso-Arabic in Pakistan and Pakistani-administered Kashmir. Dogri has several varieties, all with greater than 80% lexical similarity (within Jammu and Kashmir). Before gaining language status, per the Census of India, Dogri was classified as one of the many varieties of Punjabi, such as Majhi or Doabi. Dogri text corpus is collected from various libraries in Jammu and Kashmir, mostly from Jammu. The greater part of the text has been taken from the library of Department of Dogri, Jammu University, Jammu University Library, J&K Academy of Arts, Culture and Languages and Dogri Sansatha-Jammu.The available Text Corpus details:DomainsWordsPercentage of TotalCorpusAesthetics 5,94,60974.16 %Commerce1,3500.17 %Mass Media1,56,75619.55 %Science and Technology2,7300.34 %Social Sciences46,3265.78 %A detailed explanation of the Dogri Text Corpus will be available in the Dogri Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary & Sunil Kumar. 2019. A Gold Standard Dogri Raw Text Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview"  in  Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...


Dogri Raw Speech Corpus

17:10:26 Hours | 11 GB speech data | 61 Speakers | 12,036 Audio segments | 48 kHz | 16 bit wav.    Dogri, the language of the Dogras, belongs to the Indo-Aryan group and is the first major language of the multi-lingual region i. e. Jammu of the Jammu & Kashmir state. It derives its name from ‘Duggar’ the ancient title of this region. Dogri is a morphologically rich language having the pre-dominant word order of Subject-Object-Verb (SOV) with a flexibility to rearrange the constituents as many Indian languages allow. Dogri had its own script namely “Dogare Akkhar”or “Dogare” based on Takri script which is closely related to the Sharada script employed by Kashmiri language. This script was the official language script during the regime of Maharaja Ranbir Singh (1857-1885 AD). After the independence, the state government constituted a committee on 29th October, 1953 headed by Sh. Girdhari Lal Dogra. The committee presented a report and accordingly the state government decided to adopt Devanagari as well as Persian script for Dogri and it was incorporated in the State Constitution in 1957.    The LDC-IL speech data is collected from Jammu, from both the genders and different age groups. The LDC-IL Dogri Speech data set consists of different types of datasets that are made up of words, sentences, running texts and date formats.  Each speaker recorded these datasets which are randomly selected from a master dataset.   The available Speech Corpus details: Total Speakers 61 (30 Female and 31 Male)   Domains Audio Segments Each Domain Duration Contemporary Text (News) 60 4:27:51 Creative Text 61 2:51:42 Sentence 1527 1:24:48 Date Format 122 0:14:07 Command and Control Words 1830 1:24:31 Person Name 1222 1:23:41 Place Name 609 0:29:10 Most Frequent Word - Part 1831 1:18:06 Most Frequent Word - Full Set 2000 1:16:27 Phonetically Balanced 2050 1:50:38 Form and Function - Word 724 0:29:25   A detailed explanation of the Dogri Speech Corpus will be available in the Dogri Raw Speech Documentation.  For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Sunil Kumar Choudhary, Rajesha N.,ManasaG., 2021. Dogri Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

