A Gold Standard Nepali Raw Text Corpus

A Gold Standard Nepali Raw Text Corpus

0 reviews requests (7)
Catalogue Number: 1154
Stock In Stock

OverView

70,57,524 Words | 1,347 Tittles | XML format | 6 d...
Please Login to see the price

Dataset Description

70,57,524 Words | 1,347 Tittles | XML format | 6 domains 

Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of the North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an official language of Nepal. About a quarter of the population in Bhutan speaks Nepali. Nepali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.

The available Text Corpus details:


Domains

Words

Percentage of Total

Corpus

Aesthetics 

40,72,977

57.71 %

Commerce

30,354

0.43 %

Mass Media

22,71,064

32.18 %

Official Documents

2,426

0.03 %

Science and Technology

80,306

1.14 %

Social Sciences

6,00,397

8.51 %


A detailed explanation of the Nepali Raw Text Corpus will be available in the Nepali Text Corpus Documentation. 

For any research-based citations, please use the following citations:

  • Ramamoorthy, L., Narayan  Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai &  Rupesh Rai. 2019. A Gold Standard Nepali Text Corpus Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview"  in  Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Umesh Chamling Rai, Rupesh Rai, Samar Sinha, Jeena Rai
  • Corpus Type Raw Corpus
  • Catalogue Number 1154
  • ISBN 978-81-7343-253-8
  • Data Source Typed+Cleaned
  • Character Count 46879154
  • Word Count 7057524
  • Release Date 04-Apr-2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User
LDC-IL Raw Text Corpora: An Overview
LDC-IL Raw Speech Corpora: An Overview

Write a review

Please login or register to review