Search

Grid View:
Quickview

A Gold Standard Nepali Raw Text Corpus

requests (5)

70,57,524 Words | 1,347 Tittles | XML format | 6 domains Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of the North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an official language of Nepal. About a quarter of the population in Bhutan speaks Nepali. Nepali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics  40,72,977 57.71 % Commerce 30,354 0.43 % Mass Media 22,71,064 32.18 % Official Documents 2,426 0.03 % Science and Technology 80,306 1.14 % Social Sciences 6,00,397 8.51 % A detailed explanation of the Nepali Raw Text Corpus will be available in the Nepali Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan  Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai &  Rupesh Rai. 2019. A Gold Standard Nepali Text Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview"  in  Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Nepali Raw Speech Corpus

requests (3)

87:14:44 Hours | 56.5GB | 350 Speakers | 48975 Audio Segments | 48 kHz | 16 bit wav.Nepali belongs to the Indo-Aryan language family. Nepali is the official language of Nepal and Indian State of West Bengal and Sikkim, and spoken in the states of Uttaranchal, Assam, Arunachal Pradesh, Manipur, Mizoram and Bihar, and as well as in other countries like Myanmar, Bhutan etc. It is written in Devanagari script. The LDC-IL Nepali speech data is collected from the regions of Darjeeling, Assam and Dehradun, from both the genders and different age group. The LDC-IL Nepali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 350 (187  Female and 163 Male)) Domains Audio Segments Each Domain Duration Contemporary Text (News) 343 14:33:19 Creative Text 341 19:46:34 Sentence 8,583 13:45:34 Date Format 1,029 00:57:20 Command and Control Words 10,308 08:44:19 Person Name 6,878 09:15:04 Place Name 3,398 03:20:06 Most Frequent Word - Part 10,292 08:51:06 Most Frequent Word - Full Set 2,994 03:41:39 Phonetically Balanced 3,321 03:00:08 Form and Function - Word 1,488 01:19:35 A  detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai &  Rupesh Rai. 2019. Nepali Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Showing 1 to 2 of 2 (1 Pages)