A Gold Standard Kashmiri Raw Text Corpus

A Gold Standard Kashmiri Raw Text Corpus

0 reviews requests (16)
Catalogue Number: 1131
Stock In Stock


4,66,054 Words | 108 Tittles | XML format | 2 domains
Please Login to see the price

Dataset Description

4,66,054 Words | 108 Tittles | XML format | 2 domains

Kashmiri language is one of the 22 scheduled languages of India and is the part of the Eighth Schedule in the constitution of Jammu and Kashmir. It belongs to the Dardic group of Indo-Aryan Language family. Like other Indo-Aryan languages, Kashmiri also comprises of many dialects. The Kashmiri language was traditionally written in Sharda Script after the 8th Century A.D. However, with the passage of time Devanagri and Perso-Arabic scripts were adapted to write the Kashmiri language. The Kashmiri text can be broadly classified into two types: literary text and non-literary text. LDCIL tried to cover the entire categories in the standard list. Some categories like Novel, Short Stories Criticism and Literature have a huge number of books, but some categories like Epic, Letters, Administration, Botany, Physics, Chemistry, Zoology and Legislature have a very less number of books.

Kashmiri text has been typed in Unicode by using the In Script Keyboard in XML files. Metadata information has also been provided along with the data. The corpus has been developed from the available contemporary text. Kashmiri Text Corpus in LDC-IL comprises 466,054 Words and character count is 2646948, drawn from books, newspapers, and magazines. The representations of the two major domains are Aesthetics and Social Sciences etc.

 The available Text Corpus details:



Percentage of Total




85.93 %

Social Sciences


14.7 %

A detailed explanation of the Kashmiri Text Corpus will be available in the Kashmiri Raw Text Corpus Documentation.

For any research-based citations, please use the following citations:

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Shahid Mushtaq Bhat
  • Corpus Type Raw Corpus
  • Catalogue Number 1131
  • ISBN 978-81-7343-230-9
  • Data Source Typed+Cleaned
  • Character Count 2646948
  • Word Count 466054
  • Release Date 04-Apr-2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User
LDC-IL Raw Text Corpora: An Overview
LDC-IL Raw Speech Corpora: An Overview

Write a review

Please login or register to review