Search - Tag

Quickview

A Gold Standard Hindi Raw Text Corpus

requests (18)

1,03,17,177 Words | 1,223 Tittles | XML format | 4 domains Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh. LDC-IL Hindi Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Hindi text can be broadly classified as literary and non- literary texts. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused. Hindi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 38,22,697 37.05 % Mass Media 50,12,327 48.58 % Science and Technology 5,49,143 5.32 % Social Sciences 9,33,010 9.04 % A detailed explanation of the Hindi Text Corpus will be available in the Hindi Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra, Arimardan Kumar Tripathi, Aditi Debsharma, Satyaendra Kumar Awasthi & Madhupriya Pathak. 2019. A Gold Standard Hindi Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Hindi Raw Speech Corpus

requests (29)

121:00:06 Hours | 76.6 GB | 488 Speakers | 70686 Audio Segments | 48 kHz | 16 bit wav.Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh. The LDC-IL speech data is collected from the regions of Awadhi belt, Bhojpuri belt, Magahi belt and Khariboli belt from both the genders and different age groups. LDC-IL Hindi speech data has 121:00:06 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 488 (234 Female and 254 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 457 37:22:29 Creative Text 463 29:24:08 Sentence 10173 8:41:17 Date Format 764 0:46:56 Command and Control Words 12284 8:34:51 Person Name 8171 9:55:25 Place Name 4085 3:14:44 Most Frequent Word - Part 12315 8:09:10 Most Frequent Word - Full Set 6994 4:30:14 Phonetically Balanced 11986 8:23:43 Form and Function - Word 2994 1:57:09 A detailed explanation of the Hindi Speech Corpus will be available in the Hindi Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra, Arimardan Kumar Tripathi & Satyaendra Kumar Awasthi. 2019. Hindi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Hindi Sentence Aligned Speech Corpus

requests (1)

Dataset Description: 72:34:52 hours | 45.9 GB | 42,275 Audio Segments | 473 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Hindi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 72:34:52 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 225 female and 248 male native Hindi speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Hindi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Satyaendra Kumar Awasthi, Ankita Tiwari, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Hindi Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-28-3.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..