Your request cart is empty!
Dataset Description
78:45:33 Hours | 49.2 GB | 306 Speakers | 45,198 Audio Segments | 48 kHz | 16 bit wav
Maithili is an Indio-Aryan language, a direct descendant of Sanskrit, which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled languages of India. The LDC-IL speech data is collected from geographic dialects of Sotipura, Bajjika and Thethi dialects. It is collected from both genders and of different age groups.
The available Speech Corpus details:
Total Speakers 306 (150 Female and 156 Male)
Domains | Audio Segments | Each Domain Duration |
Contemporary Text (News) | 291 | 22:33:41 |
Creative Text | 294 | 15:34:55 |
Sentence | 7,451 | 07:08:48 |
Date Format | 585 | 00:31:41 |
Command and Control Words | 8,924 | 07:07:34 |
Person Name | 5,917 | 07:49:33 |
Place Name | 2,952 | 02:47:49 |
Most Frequent Word - Part | 8,699 | 06:56:24 |
Most Frequent Words-FullSet | 5,996 | 04:58:30 |
Phonetically Balanced Words | 3,040 | 02:26:27 |
Form and Function Words | 1,049 | 00:50:11 |
A detailed explanation of the Maithili
Speech Corpus will be available in the Maithili Speech Data
Documentation.
For any research-based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh, Dinesh Mishra & Atuleshwar Jha. 2019. Maithili Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
Item specifics
- Authors Ramamoorthy L., Narayan Choudhary, Dinesh Mishra, Arun Kumar Singh, Atuleshwar Jha
- Corpus Type Raw Corpus
- Catalogue Number 1139
- ISBN 978-81-7343-238-5
- Data Source On Field
- Duration 78:45:33
- # of Audio Segments 45,198
- Release Date 04-Apr-2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.