Search

Search Criteria

Search in subcategories Search in dataset descriptions

Quickview

A Gold Standard Chhattisgarhi Raw Text Corpus

requests (1)

Dataset Description: 14,74,496 Words | 51 Titles | XML format | Aesthetics and Mass Media DomainChhattisgarhi, a tongue of approximately 17 million people, carries profound cultural and historical significance within the region of Chhattisgarh. The Chhattisgarhi Raw Text Corpus endows an unrivaled window in documenting the colloquialisms, idioms, regional vocabularies, and grammar that are essential to establishing frameworks for linguistic processing. The Chhattisgarhi Raw Text Corpus is an extensive repository encapsulating the viable linguistic elements of Chhattisgarhi textual materials. The corpus of Chhattisgarhi text can be broadly classified as literary and non-literary texts. Data has been collected from books, magazines, newspapers and websites and it is verified to be true to the original texts and then warehoused. Chhattisgarhi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 14,35,667 (97.04 %) Mass Media 38,829 (2.6 %). A detailed explanation of the Chhattisgarhi Text Corpus will be available in the Chhattisgarhi Raw Text Corpus Documentation. For any research-based citations, please use the following citations:1. Ankita Tiwari, Satyaendra Kumar Awasthi, Narayan Kumar Choudhary 2023. A Gold Standard Chhattisgarhi Raw Text Corpus. Central Institute of Indian Languages, Mysore.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4. 3. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Chhattisgarhi Raw Speech Corpus

requests (1)

Dataset Description: 138:09:27 Hours | 88.9 GB | 140 Speakers | 359 Audio Segments | 48 kHz | 16 bit wav LDC-IL has taken a positive step in its approach towards the mother tongues spoken in India, which is an indication of greater efforts to support and promote linguistic variety in the nation. Collection of Chhattisgarhi speech data is a major effort in this approach. This step towards developing language technology for Indian mother tongues will contribute to the overall enrichment and empowerment of mother tongues.The Chhattisgarhi raw speech corpus is made up of recordings of native Chhattisgarhi speakers from various parts of the state of Chhattisgarh, and it represents a wide range of Chhattisgarhi varieties as they are spoken in various locations by diverse speakers. Each speaker from various age groups recites prompt text extracts of literary and news texts. Along with this, Spontaneous Speech has also been collected.A detailed explanation of the Chhattisgarhi Raw Speech Corpus will be available in the Chhattisgarhi Raw Speech Data Documentation. For any research-based citations, please use the following citations: 1. Satyaendra Kumar Awasthi, Ankita Tiwari, Narayan Kumar Choudhary. 2023. Chhattisgarhi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.2. Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..