A Gold Standard Dogri Raw Text Corpus

0 reviews requests (27)

Owner Central Institute of Indian Languages

Catalogue Number: 1114

Stock In Stock

OverView

8,01,771 Words | 183 Tittles |

Please Login to see the price

Tags: Dogri Raw Text Corpus

Categories Cart Account Search Recent View Go to Top

Dataset Description

8,01,771 Words | 183 Tittles | XML format | 05 Text Domains

Dogri is an Indo-Aryan language spoken by about five million people in India and Pakistan, particularly in the Jammu region of Jammu and Kashmir and Himachal Pradesh, also in northern Punjab, other parts of Jammu and Kashmir. Dogri was originally written using the Dogri script which is very close to the Takriscript. The language is now more commonly written in Devanagari in India, and in the Nastaʿliq form of Perso-Arabic in Pakistan and Pakistani-administered Kashmir. Dogri has several varieties, all with greater than 80% lexical similarity (within Jammu and Kashmir). Before gaining language status, per the Census of India, Dogri was classified as one of the many varieties of Punjabi, such as Majhi or Doabi. Dogri text corpus is collected from various libraries in Jammu and Kashmir, mostly from Jammu. The greater part of the text has been taken from the library of Department of Dogri, Jammu University, Jammu University Library, J&K Academy of Arts, Culture and Languages and Dogri Sansatha-Jammu.

The available Text Corpus details:

Domains	Words	Percentage of Total Corpus
Aesthetics	5,94,609	74.16 %
Commerce	1,350	0.17 %
Mass Media	1,56,756	19.55 %
Science and Technology	2,730	0.34 %
Social Sciences	46,326	5.78 %

A detailed explanation of the Dogri Text Corpus will be available in the Dogri Raw Text Corpus Documentation.

For any research-based citations, please use the following citations:

Ramamoorthy, L., Narayan Choudhary & Sunil Kumar. 2019. A Gold Standard Dogri Raw Text Corpus. Central Institute of Indian Languages, Mysore.
Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.

Item specifics

Authors Ramamoorthy L., Narayan Choudhary, Sunil Kumar
Corpus Type Raw Corpus
Catalogue Number 1114
ISBN 978-81-7343-213-2
Data Source Typed+Cleaned
Character Count 4130584
Word Count 801771
Release Date 04-Apr-2019
Terms and Conditions General instructions for use of the resources provided by LDC-IL.

A Gold Standard Dogri Raw Text Corpus

OverView

A Gold Standard Dogri Raw Text Corpus

Dataset Description

Item specifics

Write a review