Search

Search Criteria

Search in subcategories Search in dataset descriptions

Quickview

A Gold Standard Odia Raw Text Corpus

requests (8)

15, 88, 287 Words | 206 Titles | XML format | 05 Text DomainsOdia (formerly Oriya) is a major Indo-Aryan language, which is spoken in the states of Odisha, West Bengal, Jharkhand, Chhattisgarh, and Andhra Pradesh. It is the official language of Odisha and Jharkhand. Odia is the sixth Classical Status language as designated by the Govt. of India. LDC-IL Odia Raw Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc.. It’s encoded in a machine-readable form and stored in a standard format. All encoding being used is Unicode compatible fonts and stored in XML format. The data is embedded with metadata information. The corpus has been developed from contemporary texts in a typed method. The corpus of Odia raw text can be generally classified as literary and non- literary texts. Huge amount of literary texts are available in Odia, but knowledge/scientific texts are less, thus LDC-IL attempted to develop a balanced raw text corpus of Odia. Data has been collected from the books and the newspapers and it is verified to true to the original texts. The available of Raw Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 5,11,887 32.23 % Commerce 19,616 1.24 % Mass Media 8,02,100 50.50 % Science and Technology 31, 589 1.99 % Social Sciences 2, 23,095 14.05 % A detailed explanation of the Odia Text Corpus will be available in the Odia Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Santosh Kumar Mohanty, Raja Kumar Naik, Pramod Kumar Rout & Kshirod Kumar Das. 2019. A Gold Standard Odia Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Odia Raw Speech Corpus

requests (5)

138:06:18 hours | 89 GB | 474 Speakers | 73,418 Audio segments | 48 kHz | 16 bit wav.Odia is an Indo-Aryan language; which is mainly spoken in the state of Odisha and also in some of the border states like West Bengal, Jharkhand, Chhatisgarh and Andhra Pradesh. It is designated with Classical Language Status by the Govt. of India. The LDC-IL Odia speech data is collected from the Central and Northern parts of Odisha from both the genders and different age groups. This data consists of different types of datasets that are made up of word lists, sentences include running texts and date formats.The available Speech Corpus details:Total Speakers 474 (239 Female and 235 Male)DomainsAudio SegmentsEach DomainDurationContemporary Text (News)44942:49:56Creative Text45019:43:50Sentence11,2488:22:57Date Format9001:27:49Command and Control Words13,49914:18:49Person Name8,9985:01:40Place Name4,49613:22:45Most Frequent Word - Part8,9949:40:04Most Frequent Word - Full Set10,98910:21:04Phonetically Balanced10,43810:05:10Form and Function - Word2,9572:52:14A detailed explanation of the Bengali Speech Corpus will be available in the Odia Raw Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Raja Kumar Naik, Pramod Kumar Rout, Kshirod Kumar Das & Santosh Kumar Mohanty. 2021. Odia Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...