A Gold Standard Assamese Raw Text Corpus

0 reviews requests (13)

Owner Central Institute of Indian Languages

Catalogue Number: 1272

Stock In Stock

Please Login to see the price

Tags: Assamese Raw Text Corpus

Categories Cart Account Search Recent View Go to Top

Dataset Description

1,01,27,030 Words | 1,084 Tittles | XML format | 6 domains

Assamese or Oxomiya is the language spoken by the natives of the state of Assam in Northeast India. It is also the official language of Assam. It is spoken in some parts of Arunachal Pradesh, Nagaland and in other Northeast Indian states. However, small pockets of Assamese speakers can also be found in Bhutan and Bangladesh. LDC-IL Assamese Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Assamese text can be broadly classifieds as literary and non- literary texts. A huge amount of literary texts are available in Assamese but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Assamese. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused. Assamese Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded in Metadata information. The corpus has been created from the contemporary text in typed and crawled methods.

The available Text Corpus details:

Domain	Domain Word Count	Percentage
Aesthetics	5233452	51.68%
Commerce	66924	0.66%
Mass Media	3354996	33.13%
Official Document	1298	0.01%
Science and Technology	372790	3.68%
Social Sciences	1097570	10.84%
Total	10127030	100.00%

A detailed explanation of the Assamese Text Corpus will be available in the Assamese Raw Text Corpus Documentation.

For any research-based citations, please use the following citations:

Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, JahnobiKalita, SamhitaBharadwaj, TazninHussain, PriyansheAdhyapak,SyedaMustafizaTamim, Rajesha N., Manasa. G. 2021. A Gold Standard Assamese Raw Text Corpus. Central Institute of Indian Languages, Mysore.
Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.

Item specifics

Authors Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, Ashmrita Gogol, Jahnobi Kalita, Samhita Bharadwaj, Taznin Hussain, Priyanshee Adhyapak, Mustafiza Tamim, Rajesha N., Manasa G.
Corpus Type Raw Corpus
Catalogue Number 1272
ISBN 978-81-948885-4-3
Data Source Typed+Cleaned
Character Count 63950126
Word Count 10127030
Release Date 15-Jun-2021
Terms and Conditions General instructions for use of the resources provided by LDC-IL.

A Gold Standard Assamese Raw Text Corpus

A Gold Standard Assamese Raw Text Corpus

Dataset Description

Item specifics

Write a review