Search
A Gold Standard Bengali Raw Text Corpus
42,37,440 Words | 1,460 Tittles | XML format | 3 domainsBengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Orissa, and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there. LDC-IL Bengali Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Bengali text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Bengali but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Bengali. Data has been collected from books, magazines, and newspapers and it is verified true to the original text.Bengali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in a typed method. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 40,37,854 95.29 % Science and Technology 76,231 1.80 % Social Sciences 1,23,355 2.91 % A detailed explanation of the Bengali Text Corpus will be available in the Bengali Raw Text Corpus Documentation.For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Arundhati Sengupta, Sankarshan Dutta, Priyanka Das & Saswati Karmakar. 2019. A Gold Standard Bengali Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...
Bengali Raw Speech Corpus
128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Odisha and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there.LDC-IL Bengali Speech data is collected from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 476 (236 Female and 240 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 450 35:05:07 Creative Text 448 20:16:13 Sentence 11,239 16:05:22 Date Format 414 0:26:48 Command and Control Words 13,477 14:00:24 Person Name 9,012 4:56:22 Place Name 4,498 1:45:35 Most Frequent Word - Part 13,525 13:33:14 Most Frequent Word - Full Set 5,978 6:47:05 Phonetically Balanced 9,489 10:23:08 Form and Function - Word 4,940 5:27:41 A detailed explanation of the Bengali Speech Corpus will be available in the Bengali Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta, Sankarshan Dutta & Priyanka Das. 2019. Bengali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...
Bengali Sentence Aligned Speech Corpus
Dataset Description:69:10:03 hours | 43.3 GB | 40,240 Audio Segments | 450 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Bengali Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Bengali script. This dataset spans a duration of 69:10:03 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 223 female and 227 male native Bengali speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Bengali Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Sonali Sutradhar, Poulami Das, Rajesha N., Manasa G., Srikanth D.,Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Bengali Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-48-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Indian English-Bengali Sentence Aligned Speech Corpus
Dataset Description:09:21:08 hours | 5.53 GB | 5,676 Audio Segments | 52 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Indian English-Bengali variant Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing Roman script. This dataset spans a duration of 09:21:08 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 26 female and 26 male native Bengali speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Indian English-Bengali variant Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Poulami Das, Rajesha N., Manasa G., Srikanth D., Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Indian English-Bengali variant Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-43-6.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..
Indian English Raw Speech Corpus - Bengali Variant
25:47:11 Hours | 15.5 GB | 53 Speakers| 16,044 Audio Segments | 48 kHz | 16 bit wav.English language is a blend of Anglo-Saxon which is the prominent language of Britain in middle ages. It has been propagated to every corner of the world by colonists. English emerges as the most visible legacy of British in India because India was under British raj for almost two centuries and English is a part of education system here. Most of the states in India use their regional languages and do not have a common language to communicate. So English is used for inter-state communication.LDC-IL has 25 hours Indian English - Bengali Variant speech data. The LDC-IL Indian English Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 27 female and 26 Male from Bengali mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 53 (27 Female and 26 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 52 6:03:15 Creative Text 52 2:41:17 Sentence 1300 1:29:35 Date Format 104 0:08:56 Command and Control Words 2882 3:09:13 Person Name 1040 0:33:56 Place Name 519 1:30:22 Most Frequent Word - Part 1442 1:22:38 Most Frequent Word - Full Set 5985 6:01:44 Phonetically Balanced 1782 1:52:21 Form and Function - Word 886 0:53:54 A detailed explanation of the Indian English Raw Speech Corpus - Bengali Variant will be available in the Indian English Raw Speech Corpus - Bengali Variant Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Arundhati Sengupta, Rejitha KS, Rajesha N., Manasa, G., 2021. Indian English Raw Speech Corpus - Bengali Variant. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...