A Gold Standard Nepali Raw Text Corpus

0 reviews requests (12)

Owner Central Institute of Indian Languages

Catalogue Number: 1154

Stock In Stock

OverView

70,57,524 Words | 1,347 Tittles | XML format | 6 d...

Please Login to see the price

Tags: Nepali Raw Text Corpus

Categories Cart Account Search Recent View Go to Top

Dataset Description

70,57,524 Words | 1,347 Tittles | XML format | 6 domains

Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of the North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an official language of Nepal. About a quarter of the population in Bhutan speaks Nepali. Nepali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.

The available Text Corpus details:

Domains	Words	Percentage of Total Corpus
Aesthetics	40,72,977	57.71 %
Commerce	30,354	0.43 %
Mass Media	22,71,064	32.18 %
Official Documents	2,426	0.03 %
Science and Technology	80,306	1.14 %
Social Sciences	6,00,397	8.51 %

A detailed explanation of the Nepali Raw Text Corpus will be available in the Nepali Text Corpus Documentation.

For any research-based citations, please use the following citations:

Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai & Rupesh Rai. 2019. A Gold Standard Nepali Text Corpus. Central Institute of Indian Languages, Mysore.
Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.

Item specifics

Authors Ramamoorthy L., Narayan Choudhary, Umesh Chamling Rai, Rupesh Rai, Samar Sinha, Jeena Rai
Corpus Type Raw Corpus
Catalogue Number 1154
ISBN 978-81-7343-253-8
Data Source Typed+Cleaned
Character Count 46879154
Word Count 7057524
Release Date 04-Apr-2019
Terms and Conditions General instructions for use of the resources provided by LDC-IL.

A Gold Standard Nepali Raw Text Corpus

OverView

A Gold Standard Nepali Raw Text Corpus

Dataset Description

Item specifics

Write a review