A Gold Standard Dogri Raw Text Corpus
OverViewDogri, is an Indo-Aryan Language spoken by about five million people in India and Pakistan, Particularly in the Jammu.
Dogri, is an Indo-Aryan Language spoken by about five million people in India and Pakistan, Particularly in the Jammu.
Dogri Text Corpus encoded in a machine readable
form and stored in a standard format. The major encoding being used is Unicode
and stored in XML format. The data is embedded with metadata information. The
corpus has been created from contemporary text in typed and crawled methods.
LDC-IL Dogri Text Corpus size is 8,02,709 words drawn from 183 different
titles. The six major domains are Aesthetics, Commerce, Official Documents,
Social Sciences, Mass Media and Science & Technology.
Dogri, is an Indo-Aryan Language spoken by about five million people in India and Pakistan, Particularly in the Jammu region of Jammu and Kashmir and Himachal Pradesh, also in northern Punjab, other parts of Jammu and Kashmir. Dogri was originally written using the Dogri script which is very close to the Takri script. The language is now more commonly written in Devanagari in India, and in the Nastaʿliq form of Perso-Arabic in Pakistan and Pakistani-administered Kashmir.
Dogri text corpus is collected from various libraries in Jammu and Kashmir, mostly from Jammu. The greater part of the text has been taken from library of Department of Dogri, Jammu University, Jammu University Library, J&K Academy of Arts, Culture and Languages and DogriSansatha-Jammu.
The corpus of Dogri text can be broadly classifiedas literary and non- literary texts. Huge amount of literary texts are available in Dogri but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Dogri. Data has been collected from books, magazines and newspapers and it is verified to true to the original texts.
More detailed explanation of the Dogri Text Corpus will be available in the Dogri Raw Text Corpus Documentation.
For any research
based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary & Sunil Kumar. 2019. A Gold Standard Dogri Raw Text Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.
- Authors Ramamoorthy L., Narayan Choudhary, Sunil Kumar
- Corpus Type Raw Corpus
- Catalogue Number 1114
- ISBN 978-81-7343-213-2
- Data Source Typed+Cleaned
- Character Count 4130584
- Word Count 801771
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.