Search - Tag - Maithili

Quickview

A Gold Standard Maithili Raw Text Corpus Vol. II

requests (1)

8,11,680 Words | 54 Titles | XML format | 3 Domains | 21 Sub-categories The Maithili Raw Text Corpus endows an unrivaled window in documenting the colloquialisms, idioms, regional vocabularies, and grammar that are essential to establishing frameworks for linguistic processing. The Maithili Raw Text Corpus is an extensive repository encapsulating the viable linguistic elements of Maithili textual materials. The corpus of Maithili text can be broadly classified as literary and non-literary texts. Data has been collected from books and magazines and it is verified to be true to the original texts and then warehoused. Maithili Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and digitized methods. A detailed explanation of the Maithili Raw Text Corpus Vol. II will be available in the Maithili Text Corpus Documentation. For any research-based citations, please use the following citations:Shantanu Kumar, Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. A Gold Standard Maithili Raw Text Corpus Vol. II. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-01-9. Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary. 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-33-0...

Quickview

Maithili Raw Speech Corpus

requests (18)

78:45:33 Hours | 49.2 GB | 306 Speakers | 45,198 Audio Segments | 48 kHz | 16 bit wavMaithili is an Indio-Aryan language, a direct descendant of Sanskrit, which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled languages of India. The LDC-IL speech data is collected from geographic dialects of Sotipura, Bajjika and Thethi dialects. It is collected from both genders and of different age groups.The available Speech Corpus details:Total Speakers 306 (150 Female and 156 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)29122:33:41Creative Text29415:34:55Sentence7,45107:08:48Date Format58500:31:41Command and Control Words8,92407:07:34Person Name5,91707:49:33Place Name2,95202:47:49Most Frequent Word - Part8,69906:56:24Most Frequent Words-FullSet5,99604:58:30Phonetically Balanced Words3,04002:26:27Form and Function Words1,04900:50:11 A detailed explanation of the Maithili Speech Corpus will be available in the Maithili Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh, Dinesh Mishra & Atuleshwar Jha. 2019. Maithili Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Maithili Raw Speech Corpus Vol. II

requests (2)

109:09:50 hours | 206 Audio Segments | 122 SpeakersThe LDC-IL Maithili Raw Speech dataset Vol.II comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 109:09:50 (hh:mm:ss) , consisting of read speech with continuous text, and spontaneous speech along with the its transcription in Devnagari. The data is derived from 49 female and 73 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Maithili Raw Speech Documentation.For any research-based citations, please use the following citations:Shantanu Kumar, Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Maithili Raw Speech Corpus Vol. II. Central Institute of Indian Languages, Mysore. 978-93-48633-37-8. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Maithili Sentence Aligned Speech Corpus

requests (9)

Dataset Description: 41:54:30 hours | 26 GB | 21,412 Audio Segments | 300 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Maithili Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 41:54:30 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 147 female and 153 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Maithili Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Shantanu Kumar, Dinesh Mishra, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Maithili Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-96-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Maithili Sentence Aligned Speech Corpus (Tirhuta Script)

requests (2)

41:54:30 hours | 26 GB | 21,412 Audio Segments | 300 speakers The LDC-IL Maithili Sentence Aligned Speech Corpus(Tirhuta Script) dataset comprises audio files in wav format, accompanied by a corresponding textual layer containingphonetically normalized and orthographically normalized annotations inTirhuta Script. This dataset spans a duration of 41:54:30(hh:mm:ss) , consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 147 female and 153 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can befound in the The LDC-IL Maithili Sentence Aligned Speech Corpus(Tirhuta Script) Documentation.For any research-based citations, please use the following citations: Dinesh Mishra, Shantanu Kumar, Dr. Narayan Kumar Choudhary, Rajesha N., Prof. Shailendra Mohan. Maithili Sentence Aligned Speech Corpus(Tirhuta Script). Central Instituteof Indian Languages, Mysore. 978-93-48633-51-4Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0..

Quickview

Maithili Text to Speech Corpus

requests (2)

30:59:20 hours | 19.56 GB | 32260 Audio Segments | 2 SpeakersThe LDC-IL Maithili Text to Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer in Devanagari script. This dataset spans a duration of 32:42:20 (hh:mm:ss) , consisting of read speech in the studio setup. The data is derived from 01 female and 01 male native Maithili speakers. A comprehensive explanation of dataset can be found in the Maithili Text to Speech Documentation.For any research-based citations, please use the following citations:Shantanu Kumar, Dinesh Mishra, Saurabh Varik, Stephen Fernandes, Nithin S., Roopashri M. R., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Maithili Text to Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-36-1.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

A Gold Standard Maithili Raw Text Corpus

requests (22)

53,16,552 Words | 499 Tittles | XML format | 5 domainsMaithili is an Indio-Aryan language, a direct descendant of Sanskrit. Which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled language of India. LDC-IL Maithili Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Maithili text can be broadly classified as literary and non- literary texts. Huge amount of literary texts are available in Maithili but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Maithili. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused.Maithili Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details:DomainsWordsPercentage of TotalCorpusAesthetics 38,97,26473.30 %Commerce50,97500.96 %Mass Media12,53,09023.57 %Science and Technology3,13600.06 %Social Sciences1,12,08702.11 %A detailed explanation of the Maithili Raw Text Corpus will be available in the Maithili Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh & Dinesh Mishra. 2019. A Gold Standard Maithili Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...