language is one of the 22 scheduled languages of India and is a part of the
Eighth Schedule in the constitution of Jammu and Kashmir.
text has been typed in Unicode by using the In Script Keyboard in XML files.
Metadata information has also been provided along with the data. The corpus has
been developed from the available contemporary text. Kashmiri Text Corpus in
LDC-IL comprises of 466,054 Words and character count is 2646948, drawn
from books, newspapers and magazines. The representations of the two major
domains are Aesthetics and Social Sciences etc.
Kashmiri language is one of the 22 scheduled languages of India and is the part of Eighth schedule in the constitution of Jammu and Kashmir. It belongs to Dardic group of Indo-Aryan Language family. Like other Indo-Aryan languages, Kashmiri also comprises of many dialects. Kashmiri language was traditionally written in Sharda Script after the 8th Century A.D. However, with the passage of time Devanagri and Perso-Arabic scripts were adapted to write Kashmiri language. The Kashmiri text can be broadly classified in two types: literary text and non-literary text. LDCIL tried to cover the entire categories in standard list. Some categories like Novel, Short Stories Criticism, and Literature have a huge number of books, but some categories like Epic, Letters, Administration, Botany, Physics, Chemistry, Zoology and Legislature have very less number of books.
More detailed explanation of the Kashmiri Text Corpus will be available in the Kashmiri Raw Text Corpus Documentation.
For any research based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary & Shahid Mushtaq Bhat. 2019. A Gold Standard Kashmiri Raw Text Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.
- Authors Ramamoorthy L., Narayan Choudhary, Shahid Mushtaq Bhat
- Corpus Type Raw Corpus
- Catalogue Number 1131
- ISBN 978-81-7343-230-9
- Data Source Typed+Cleaned
- Character Count 2646948
- Word Count 466054
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.