Your request cart is empty!
Dataset Description
1,01,27,030 Words | 1,084 Tittles | XML format | 6
domains
Assamese
or Oxomiya is the language spoken by the natives of the state of Assam in
Northeast India. It is also the official language of Assam. It is spoken in
some parts of Arunachal Pradesh, Nagaland and in other Northeast Indian states.
However, small pockets of Assamese speakers can also be found in Bhutan and
Bangladesh. LDC-IL Assamese Text Corpus
developed according to various factors such as quality of the text,
representativeness, retrievable format, size of corpus, authenticity, etc. For
collecting text corpus LDC-IL adopts a standard category list of various
domains and a prior set of criteria. The corpus of Assamese text can be broadly
classifieds as literary and non- literary texts. A huge amount of literary
texts are available in Assamese but scientific texts are less thus LDC-IL
attempts to develop balanced text corpora of Assamese. Data has been collected
from books, magazines, and newspapers and it is verified to true to the
original texts then warehoused. Assamese Text Corpus encoded in a
machine-readable form and stored in a standard format. The major encoding being
used is Unicode and stored in XML format. The data is embedded in Metadata
information. The corpus has been created from the contemporary text in typed
and crawled methods.
The
available Text Corpus details:
Domain |
Domain Word Count |
Percentage |
Aesthetics |
5233452 |
51.68% |
Commerce |
66924 |
0.66% |
Mass Media |
3354996 |
33.13% |
Official Document |
1298 |
0.01% |
Science and Technology |
372790 |
3.68% |
Social Sciences |
1097570 |
10.84% |
Total |
10127030 |
100.00% |
A detailed explanation of the Assamese Text Corpus
will be available in the Assamese Raw Text Corpus Documentation.
For any research-based citations, please use the following citations:
- Ramamoorthy L., Narayan Kumar Choudhary, Atreyee
Sharma, JahnobiKalita, SamhitaBharadwaj, TazninHussain,
PriyansheAdhyapak,SyedaMustafizaTamim, Rajesha N., Manasa. G. 2021. A Gold Standard Assamese
Raw Text Corpus. Central
Institute of Indian Languages, Mysore.
- Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.
Item specifics
- Authors Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, Ashmrita Gogol, Jahnobi Kalita, Samhita Bharadwaj, Taznin Hussain, Priyanshee Adhyapak, Mustafiza Tamim, Rajesha N., Manasa G.
- Corpus Type Raw Corpus
- Catalogue Number 1272
- ISBN 978-81-948885-4-3
- Data Source Typed+Cleaned
- Character Count 63950126
- Word Count 10127030
- Release Date 15-Jun-2021
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.