A Gold Standard Kashmiri Raw Text Corpus

0 reviews requests (26)

Owner Central Institute of Indian Languages

Catalogue Number: 1131

Stock In Stock

OverView

4,66,054 Words | 108 Tittles | XML format | 2 domains

Please Login to see the price

Tags: Kashmiri Raw Text Corpus

Categories Cart Account Search Recent View Go to Top

Dataset Description

4,66,054 Words | 108 Tittles | XML format | 2 domains

Kashmiri language is one of the 22 scheduled languages of India and is the part of the Eighth Schedule in the constitution of Jammu and Kashmir. It belongs to the Dardic group of Indo-Aryan Language family. Like other Indo-Aryan languages, Kashmiri also comprises of many dialects. The Kashmiri language was traditionally written in Sharda Script after the 8th Century A.D. However, with the passage of time Devanagri and Perso-Arabic scripts were adapted to write the Kashmiri language. The Kashmiri text can be broadly classified into two types: literary text and non-literary text. LDCIL tried to cover the entire categories in the standard list. Some categories like Novel, Short Stories Criticism and Literature have a huge number of books, but some categories like Epic, Letters, Administration, Botany, Physics, Chemistry, Zoology and Legislature have a very less number of books.

Kashmiri text has been typed in Unicode by using the In Script Keyboard in XML files. Metadata information has also been provided along with the data. The corpus has been developed from the available contemporary text. Kashmiri Text Corpus in LDC-IL comprises 466,054 Words and character count is 2646948, drawn from books, newspapers, and magazines. The representations of the two major domains are Aesthetics and Social Sciences etc.

The available Text Corpus details:

Domains

Words

Percentage of Total

Corpus

Aesthetics

4,00,474

85.93 %

Social Sciences

65,580

14.7 %

A detailed explanation of the Kashmiri Text Corpus will be available in the Kashmiri Raw Text Corpus Documentation.

For any research-based citations, please use the following citations:

Ramamoorthy, L., Narayan Choudhary & Shahid Mushtaq Bhat. 2019. A Gold Standard Kashmiri Raw Text Corpus. Central Institute of Indian Languages, Mysore.
Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.

Item specifics

Authors Ramamoorthy L., Narayan Choudhary, Shahid Mushtaq Bhat
Corpus Type Raw Corpus
Catalogue Number 1131
ISBN 978-81-7343-230-9
Data Source Typed+Cleaned
Character Count 2646948
Word Count 466054
Release Date 04-Apr-2019
Terms and Conditions General instructions for use of the resources provided by LDC-IL.

A Gold Standard Kashmiri Raw Text Corpus

OverView

A Gold Standard Kashmiri Raw Text Corpus

Dataset Description

Item specifics

Write a review