176:53:28 hours of 113 Gigabytes speech data | 456 Speakers | 77443 Audio segments | 48 kHz | 16 bit wav
Bodo, one of the scheduled language of India, is one of
the Tonal languages of the world. There are two clearly distinguishable kinds
of tones in Bodo which are known as Low and High. The language
belongs to the Tibeto Burmese linguistic family. It is the language of Bodos,
which are the major tribes of Indian State of Assam.
The LDC-IL speech data is collected from the regions of Chirang, Baksa Sonitpur Udalguri, Kamrup, Barpeta, Udalguri, Kokrajhar districts of Assam State of India which covers Bwrdwnari, Eastern, and Standard dialects. The data is collected from both the genders and different age group.
The LDC-IL Bodo Speech data set consists of different types of datasets that are made up of word lists, sentences running texts and date formats.
The available Speech Corpus details:
- Total of 456 speakers (220 Female and 236 Male.)
- Contemporary Text (News) - 411 Audio Segments - 53:47:56 Hours
- Creative Text - 413 Audio Segments - 26:43:07 Hours
- Sentence - 10257 Audio Segments - 9:38:58 Hours
- Date - 938 Audio Segments - 1:16:54 Hours
- Command and Control Words - 12348 Audio Segments - 14:19:32 Hours
- Person Name - 8222 Audio Segments - 14:49:44 Hours
- Place Name - 4115 Audio Segments - 05:17:14 Hours
- Most Frequent Word-Part - 12397 Audio Segments - 14:34:05 Hours
- Most Frequent Word-Full - 15999 Audio Segments - 20:07:33 Hours
- Phonetically Balanced - 5960 Audio Segments - 7:50:00 Hours
- Form and Function Word - 6383 Audio Segments - 8:28:25 Hours
A much more detailed explanation of the Bodo Speech Corpus will be available in the Bodo Speech Data Documentation.
For any research based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Bridul Basumatary & Farson Daimary. 2019. Bodo Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: ” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
- Authors Ramamoorthy L., Narayan Choudhary, Bridul Basumatary, Farson Daimary
- Corpus Type Raw Corpus
- Catalogue Number 1112
- ISBN 978-81-7343-211-8
- Data Source On Field
- Duration 176:53:28
- # of Audio Segments 77443
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.