A Gold Standard Dogri Raw Text Corpus

A Gold Standard Dogri Raw Text Corpus

0 reviews requests (6)
Catalogue Number: 1114
Stock In Stock

OverView

Dogri, is an Indo-Aryan Language spoken by about five million people in India and Pakistan, Particularly in the Jammu.
Please Login to see the price

Dataset Description

Dogri, is an Indo-Aryan Language spoken by about five million people in India and Pakistan, Particularly in the Jammu.

Dogri Text Corpus encoded in a machine readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from contemporary text in typed and crawled methods. LDC-IL Dogri Text Corpus size is 8,02,709 words drawn from 183 different titles. The six major domains are Aesthetics, Commerce, Official Documents, Social Sciences, Mass Media and Science & Technology.

Overview:

Dogri, is an Indo-Aryan Language spoken by about five million people in India and Pakistan, Particularly in the Jammu region of Jammu and Kashmir and Himachal Pradesh, also in northern Punjab, other parts of Jammu and Kashmir. Dogri was originally written using the Dogri script which is very close to the Takri script. The language is now more commonly written in Devanagari in India, and in the Nastaʿliq form of Perso-Arabic in Pakistan and Pakistani-administered Kashmir.

Dogri text corpus is collected from various libraries in Jammu and Kashmir, mostly from Jammu. The greater part of the text has been taken from library of Department of Dogri, Jammu University, Jammu University Library, J&K Academy of Arts, Culture and Languages and DogriSansatha-Jammu.

The corpus of Dogri text can be broadly classifiedas literary and non- literary texts. Huge amount of literary texts are available in Dogri but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Dogri. Data has been collected from books, magazines and newspapers and it is verified to true to the original texts.

More detailed explanation of the Dogri Text Corpus will be available in the Dogri Raw Text Corpus Documentation. 

For any research based citations, please use the following citations:

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Sunil Kumar
  • Corpus Type Raw Corpus
  • Catalogue Number 1114
  • ISBN 978-81-7343-213-2
  • Data Source Typed+Cleaned
  • Character Count 4130584
  • Word Count 801771
  • Release Date 04/04/2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User

Write a review

Please login or register to review