Search - Tag - Marathi

Quickview

A Gold Standard Marathi Raw Text Corpus

requests (10)

21,57,109 Words | 678 Tittles | XML format | 5 domainsMarathi is an Indo-Aryan language. It is the official language of Maharashtra state of India. Marathi is primarily spoken in Maharashtra (India) and parts of neighboring states of Gujarat, Madhya Pradesh, Goa, Karnataka (Particularly the bordering districts of Belgaum, Bidar, Gulbarga, and Uttara Kannada), union-territories of Daman and Diu and Dadra and Nagar Haveli. LDC-IL Marathi Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Marathi text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Marathi but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Marathi. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts than warehoused. Marathi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 15,15,039 70.23 % Commerce 20,795 0.97 % Mass Media 3,63,120 16.83 % Science and Technology 55,902 2.59 % Social Sciences 2,02,253 9.38 % A detailed explanation of the Marathi Text Corpus will be available in the Marathi Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Gajanan R Apine & Apurva P Betkekar. 2019. A Gold Standard Marathi Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Marathi Raw Speech Corpus

requests (15)

89:17:25 Hours | 58 GB speech data | 307 Speakers | 58544 Audio segments | 48 kHz | 16 bit wav.The Marathi language is an Indo-Aryan language. The Marathi language is prevalent in the 9th century. Standard Marathi (Puneri) is the official language of the State of Maharashtra. Standard Marathi is based on dialects used by academics and the print media. It is believed that the language of the Marathi language is influenced by Sanskrit. Marathi is written in the Devanagari script. The phoneme inventory of Marathi is similar to that of many other Indo-Aryan languages. The LDC-IL speech data is collected from the regions of Marathwada, Puneri, Vidharbh, and Goa from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 307 (156 Female and 151 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 302 22:26:06 Creative Text 302 13:37:34 Sentence 7,555 6:49:58 Date Format 604 0:39:57 Command and Control Words 9,068 7:50:10 Person Name 6,058 7:44:56 Place Name 3,037 2:49:32 Most Frequent Word - Part 9,104 7:22:57 Most Frequent Word - Full Set 10,987 9:53:28 Phonetically Balanced 4,609 4:10:47 Form and Function - Word 6,918 5:52:00 A detailed explanation of the Marathi Speech Corpus will be available in the Marathi Speech Data Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Gajanan R Apine & Apurva P Betkekar. 2019. Marathi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Marathi Sentence Aligned Speech Corpus

requests (1)

Dataset Description: 41:34:04 hours | 26.7 GB | 23,234 Audio Segments | 302 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Marathi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 89:17:25 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 153 female and 149 male native Marathi speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Marathi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Bhageshree K Khandale, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Marathi Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-92-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..