Search

Search Criteria

Search in subcategories Search in dataset descriptions

Quickview

A Gold Standard Kashmiri Raw Text Corpus

requests (15)

4,66,054 Words | 108 Tittles | XML format | 2 domainsKashmiri language is one of the 22 scheduled languages of India and is the part of the Eighth Schedule in the constitution of Jammu and Kashmir. It belongs to the Dardic group of Indo-Aryan Language family. Like other Indo-Aryan languages, Kashmiri also comprises of many dialects. The Kashmiri language was traditionally written in Sharda Script after the 8th Century A.D. However, with the passage of time Devanagri and Perso-Arabic scripts were adapted to write the Kashmiri language. The Kashmiri text can be broadly classified into two types: literary text and non-literary text. LDCIL tried to cover the entire categories in the standard list. Some categories like Novel, Short Stories Criticism and Literature have a huge number of books, but some categories like Epic, Letters, Administration, Botany, Physics, Chemistry, Zoology and Legislature have a very less number of books.Kashmiri text has been typed in Unicode by using the In Script Keyboard in XML files. Metadata information has also been provided along with the data. The corpus has been developed from the available contemporary text. Kashmiri Text Corpus in LDC-IL comprises 466,054 Words and character count is 2646948, drawn from books, newspapers, and magazines. The representations of the two major domains are Aesthetics and Social Sciences etc. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 4,00,474 85.93 % Social Sciences 65,580 14.7 % A detailed explanation of the Kashmiri Text Corpus will be available in the Kashmiri Raw Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary & Shahid Mushtaq Bhat. 2019. A Gold Standard Kashmiri Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Kashmiri Raw Speech Corpus

requests (9)

28:10:07 Hours | 18 GB speech data | 150 Speakers | 16,380 Audio segments | 48 kHz | 16 bit wav. Kashmiri Language belongs to Dardic group of Indo-Aryan family. It is known by names ‘Kashur’ and‘Kashmiri’. It is primarily spoken in Kashmir valley and Pir-Panchal range of Jammu region. Kashmiri language has two types of dialects i.e., regional dialects and social dialects. Apart from the Kashmiri spoken in valley itself there are other varieties of language that are spoken outside the valley and those varieties are considered as regional dialects of Kashmiri language. These regional dialects consist of Kishtawari, Poguli and Rambani. Kashmiri language has three social dialects as well which are known by the names Yamraz, Marak and Kamraz. The LDC-IL speech data is collected from Kashmiri Valley are from Pulwama, Srinagar, and Anantnag. This data is collected from both the genders at different age groups. The LDC-IL Kashmiri Speech data consists of different types of datasets that are made up of words, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 150 (78 Female and 72 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 147 3:56:57 Creative Text 148 12:41:33 Sentence 3704 2:40:24 Date Format 281 0:10:36 Command and Control Words 4288 3:04:32 Person Name 2065 1:53:21 Place Name 1468 1:04:37 Most Frequent Word - Part 4279 2:38:07 A detailed explanation of Kashmiri Speech Corpus will be available in the Kashmiri Speech Data Documentation. For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Shahid Mushtaq Bhatt, Rajesha N., Manasa G., 2021. Kashmiri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...