Search - Tag - Nepali

Quickview

A Gold Standard Nepali Raw Text Corpus

requests (12)

70,57,524 Words | 1,347 Tittles | XML format | 6 domains Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of the North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an official language of Nepal. About a quarter of the population in Bhutan speaks Nepali. Nepali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 40,72,977 57.71 % Commerce 30,354 0.43 % Mass Media 22,71,064 32.18 % Official Documents 2,426 0.03 % Science and Technology 80,306 1.14 % Social Sciences 6,00,397 8.51 % A detailed explanation of the Nepali Raw Text Corpus will be available in the Nepali Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai & Rupesh Rai. 2019. A Gold Standard Nepali Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Nepali Raw Speech Corpus

requests (10)

87:14:44 Hours | 56.5GB | 350 Speakers | 48975 Audio Segments | 48 kHz | 16 bit wav.Nepali belongs to the Indo-Aryan language family. Nepali is the official language of Nepal and Indian State of West Bengal and Sikkim, and spoken in the states of Uttaranchal, Assam, Arunachal Pradesh, Manipur, Mizoram and Bihar, and as well as in other countries like Myanmar, Bhutan etc. It is written in Devanagari script. The LDC-IL Nepali speech data is collected from the regions of Darjeeling, Assam and Dehradun, from both the genders and different age group. The LDC-IL Nepali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 350 (187 Female and 163 Male)) Domains Audio Segments Each Domain Duration Contemporary Text (News) 343 14:33:19 Creative Text 341 19:46:34 Sentence 8,583 13:45:34 Date Format 1,029 00:57:20 Command and Control Words 10,308 08:44:19 Person Name 6,878 09:15:04 Place Name 3,398 03:20:06 Most Frequent Word - Part 10,292 08:51:06 Most Frequent Word - Full Set 2,994 03:41:39 Phonetically Balanced 3,321 03:00:08 Form and Function - Word 1,488 01:19:35 A detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai & Rupesh Rai. 2019. Nepali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Nepali Sentence Aligned Speech Corpus

requests (5)

Dataset Description: 43:04:23 hours | 27.7 GB | 21,481 Audio Segments | 346 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Nepali Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 43:04:23 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 187 female and 159 male native Nepali speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Nepali Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Rupesh Rai, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Nepali Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-98-6.2.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..