Search

Search Criteria

Search in subcategories Search in dataset descriptions

Quickview

A Gold Standard Bodo Raw Text Corpus

requests (24)

29,15,544 Words | 80 Tittles | XML format | 5 domainsBodo is a major tribal language that belongs to the Tibeto-Burman language family. Bodo language is spoken in Assam and other parts of North-East India. The Bodo language is one of the major language of Assam and official language in the Bodoland Territorial Area Districts. Several rivers like Dihing, Dibru, Dihong, Dikrai, etc. in the North-East region were termed after some Bodo words which reveal the three-dimensional distribution arrangement of connected ethnocultural groups with then cultural personae and occurrence. Bodo is written in Devanagari. The Bodo text corpus is extracted from contemporary text sources. LDC-IL Bodo Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. Bodo text corpus is collected from various libraries in Assam mostly from Kokrajhar, Chirang, Baksa, Udalguri, and Guwahati. LDC-IL attempts to develop balanced text corpora of Bodo. Data has been collected from books, magazines, and newspapers and it is verified true to the original text. Bodo Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus Details:DomainsWordsPercentage of TotalCorpusAesthetics 4,74,96016.29 %Commerce25,0640.86 %Mass Media16,79,51157.61 %Science and Technology1,72,1515.90 %Social Sciences5,63,85819.34 %A detailed explanation of the Bodo Text Corpus will be available in the Bodo Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Bridul Basumatary & Farson Daimary. 2019. A Gold Standard Bodo Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Bodo Raw Speech Corpus

requests (13)

176:53:28 hours of 113 GB | 456 Speakers | 77443 Audio segments | 48 kHz | 16 bit wavBodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic family. It is the language of Bodos, which are the major tribes of the Indian State of Assam.Bodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic family. It is the language of Bodos, which are the major tribes of the Indian State of Assam. The LDC-IL Bodo speech data is collected from the regions of Chirang, Baksa Sonitpur Udalguri, Kamrup, Barpeta, Udalguri, Kokrajhar districts of Assam State of India which covers Bwrdwnari, Eastern, and Standard dialects. The data is collected from both the genders and different age groups.The available Speech Corpus details:Total Speakers 456 (220 Female and 236 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 411 53:47:56 Creative Text 413 26:47:07 Sentence 10,257 09:16:54 Date Format 938 01:58:08 Command and Control Words 12,348 14:19:32 Person Name 8,222 14:49:44 Place Name 4,115 05:17:14 Most Frequent Word - Part 12,397 14:34:05 Most Frequent Word - Full Set 6,994 04:30:14 Phonetically Balanced 15,999 20:07:33 Form and Function - Word 6,383 08:28:25 A detailed explanation of the Bodo Speech Corpus will be available in the Bodo Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Bridul Basumatary & Farson Daimary. 2019. Bodo Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...