Your request cart is empty!
Dataset Description
70,57,524 Words | 1,347 Tittles | XML format | 6 domains
Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of the North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an official language of Nepal. About a quarter of the population in Bhutan speaks Nepali. Nepali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.
The available Text Corpus details:
Domains |
Words |
Percentage of Total Corpus |
Aesthetics |
40,72,977 |
57.71 % |
Commerce |
30,354 |
0.43 % |
Mass Media |
22,71,064 |
32.18 % |
Official Documents |
2,426 |
0.03 % |
Science and Technology |
80,306 |
1.14 % |
Social Sciences |
6,00,397 |
8.51 % |
For any research-based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai & Rupesh Rai. 2019. A Gold Standard Nepali Text Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.
Item specifics
- Authors Ramamoorthy L., Narayan Choudhary, Umesh Chamling Rai, Rupesh Rai, Samar Sinha, Jeena Rai
- Corpus Type Raw Corpus
- Catalogue Number 1154
- ISBN 978-81-7343-253-8
- Data Source Typed+Cleaned
- Character Count 46879154
- Word Count 7057524
- Release Date 04-Apr-2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.