Maithili Raw Speech Corpus
78:45:33 Hours | 49.2 GB | 306 Speakers | 45,198 Audio Segments | 48 kHz | 16 bit wav
Maithili is an Indio-Aryan language, a direct descendant of Sanskrit, which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled languages of India. The LDC-IL speech data is collected from geographic dialects of Sotipura, Bajjika and Thethi dialects. It is collected from both genders and of different age groups.
The available Speech Corpus details:
Total Speakers 306 (150 Female and 156 Male)
Each Domain Duration
Contemporary Text (News)
Command and Control Words
Most Frequent Word - Part
Most Frequent Words-FullSet
Phonetically Balanced Words
Form and Function Words
A detailed explanation of the Maithili Speech Corpus will be available in the Maithili Speech Data Documentation.
For any research-based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh, Dinesh Mishra & Atuleshwar Jha. 2019. Maithili Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
- Authors Ramamoorthy L., Narayan Choudhary, Dinesh Mishra, Arun Kumar Singh, Atuleshwar Jha
- Corpus Type Raw Corpus
- Catalogue Number 1139
- ISBN 978-81-7343-238-5
- Data Source On Field
- Duration 78:45:33
- # of Audio Segments 45,198
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.