A Gold Standard Assamese Raw Text Corpus

A Gold Standard Assamese Raw Text Corpus

0 reviews requests (7)
Catalogue Number: 1272
Stock In Stock
Please Login to see the price

Dataset Description

1,01,27,030 Words | 1,084 Tittles | XML format | 6 domains

Assamese or Oxomiya is the language spoken by the natives of the state of Assam in Northeast India. It is also the official language of Assam. It is spoken in some parts of Arunachal Pradesh, Nagaland and in other Northeast Indian states. However, small pockets of Assamese speakers can also be found in Bhutan and Bangladesh. LDC-IL Assamese Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Assamese text can be broadly classifieds as literary and non- literary texts. A huge amount of literary texts are available in Assamese but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Assamese. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused. Assamese Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded in Metadata information. The corpus has been created from the contemporary text in typed and crawled methods.


The available Text Corpus details:



Domain Word Count








Mass Media



Official Document



Science and Technology



Social Sciences







A detailed explanation of the Assamese Text Corpus will be available in the Assamese Raw Text Corpus Documentation.

For any research-based citations, please use the following citations:

  • Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, JahnobiKalita, SamhitaBharadwaj, TazninHussain, PriyansheAdhyapak,SyedaMustafizaTamim, Rajesha N., Manasa. G. 2021. A Gold Standard Assamese Raw Text Corpus. Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview"  in  Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.

Item specifics

  • Authors Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, Ashmrita Gogol, Jahnobi Kalita, Samhita Bharadwaj, Taznin Hussain, Priyanshee Adhyapak, Mustafiza Tamim, Rajesha N., Manasa G.
  • Corpus Type Raw Corpus
  • Catalogue Number 1272
  • ISBN 978-81-948885-4-3
  • Data Source Typed+Cleaned
  • Character Count 63950126
  • Word Count 10127030
  • Release Date 15-Jun-2021
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User
LDC-IL Raw Text Corpora: An Overview
LDC-IL Raw Speech Corpora: An Overview

Write a review

Please login or register to review