Search - Tag - Assamese

Quickview

A Gold Standard Assamese Raw Text Corpus

requests (10)

1,01,27,030 Words | 1,084 Tittles | XML format | 6 domains Assamese or Oxomiya is the language spoken by the natives of the state of Assam in Northeast India. It is also the official language of Assam. It is spoken in some parts of Arunachal Pradesh, Nagaland and in other Northeast Indian states. However, small pockets of Assamese speakers can also be found in Bhutan and Bangladesh. LDC-IL Assamese Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Assamese text can be broadly classifieds as literary and non- literary texts. A huge amount of literary texts are available in Assamese but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Assamese. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused. Assamese Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded in Metadata information. The corpus has been created from the contemporary text in typed and crawled methods. The available Text Corpus details: Domain Domain Word Count Percentage Aesthetics 5233452 51.68% Commerce 66924 0.66% Mass Media 3354996 33.13% Official Document 1298 0.01% Science and Technology 372790 3.68% Social Sciences 1097570 10.84% Total 10127030 100.00% A detailed explanation of the Assamese Text Corpus will be available in the Assamese Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, JahnobiKalita, SamhitaBharadwaj, TazninHussain, PriyansheAdhyapak,SyedaMustafizaTamim, Rajesha N., Manasa. G. 2021. A Gold Standard Assamese Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Assamese Raw Speech Corpus

requests (14)

54:21:12 Hours | 32.5 GB | 304 Speakers | 37,570 Audio Segments | 48 kHz | 16 bit wav. Assamese is the official language of Assam. Its linguistic presence is widely presented in the state of Assam and some parts of Arunachal Pradesh and Nagaland.According to 2011 census, the Assamese Language is spoken by 15 million speakers.Assamese a widely spoken language does encounter several dialectal variations. The regional dialects can be broadly divided into two parts - the Eastern Group and the Western Group.LDC-IL divided the Assamese speaking areas into these four regions Xiboxagoria, Central Assam, Kamrupi, Goalparia and have collected speech data from each speaker. LDC-IL Assamese Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 304 (154 Female and 150 Male)DomainsAudio SegmentsEach DomainDurationContemporary Text (News)30417:23:25Creative Text30411:44:37Sentence75935:55:29Date Format5990:33:59Command and Control Words91184:56:49Person Name60815:38:07Place Name30441:58:33Phonetically Balanced-W465673:41:45Form and Function-Word-W539602:28:28A detailed explanation of the Assamese Speech Corpus will be available in the Assamese Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, Jahnobi Kalita, Samhita Bharadwaj, Plabita Bora, Priyanshee Adhyapak, Mustafiza Tamim, Rajesha N., Manasa G.. 2021. Assamese Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Assamese Sentence Aligned Speech Corpus

requests (6)

Dataset Description 30:18:16 hours|19.5GB |21,716 Audio Segments |304 speakers The annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Assamese Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Assamese script. This dataset spans a duration of 30:18:16 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 154 female and 150 male native Assamese speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Assamese Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:1. Syeda Mustafiza Tamim, Priyanshe Adhyapak, Rajesha N., Manasa G., Srikanth D.,Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Assamese Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-53-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Assamese Text to Speech Corpus

requests (1)

Assamese Text to Speech Corpus 44:49:34 hours | 28.85 GB | 32,594 Audio Segments | 2 Speakers The LDC-IL Assamese Text to Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer in Assamese script. This dataset spans a duration of 44:49:34 (hh:mm:ss) , consisting of read speech in the studio setup. The data is derived from 01 female and 01 male native Assamese speakers. A comprehensive explanation of dataset can be found in the Assamese Text to Speech Documentation. For any research-based citations, please use the following citations: Syeda Mustafiza Tamim, Prangshu Manjul, Stephen Fernandes, Nithin S., Roopashri M. R., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Assamese Text to Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-45-3. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...