Central Institute of Indian Languages

Quickview

A Gold Standard Nepali Raw Text Corpus

requests (12)

70,57,524 Words | 1,347 Tittles | XML format | 6 domains Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of the North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an official language of Nepal. About a quarter of the population in Bhutan speaks Nepali. Nepali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 40,72,977 57.71 % Commerce 30,354 0.43 % Mass Media 22,71,064 32.18 % Official Documents 2,426 0.03 % Science and Technology 80,306 1.14 % Social Sciences 6,00,397 8.51 % A detailed explanation of the Nepali Raw Text Corpus will be available in the Nepali Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai & Rupesh Rai. 2019. A Gold Standard Nepali Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Odia Raw Text Corpus

requests (14)

15, 88, 287 Words | 206 Titles | XML format | 05 Text DomainsOdia (formerly Oriya) is a major Indo-Aryan language, which is spoken in the states of Odisha, West Bengal, Jharkhand, Chhattisgarh, and Andhra Pradesh. It is the official language of Odisha and Jharkhand. Odia is the sixth Classical Status language as designated by the Govt. of India. LDC-IL Odia Raw Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc.. It’s encoded in a machine-readable form and stored in a standard format. All encoding being used is Unicode compatible fonts and stored in XML format. The data is embedded with metadata information. The corpus has been developed from contemporary texts in a typed method. The corpus of Odia raw text can be generally classified as literary and non- literary texts. Huge amount of literary texts are available in Odia, but knowledge/scientific texts are less, thus LDC-IL attempted to develop a balanced raw text corpus of Odia. Data has been collected from the books and the newspapers and it is verified to true to the original texts. The available of Raw Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 5,11,887 32.23 % Commerce 19,616 1.24 % Mass Media 8,02,100 50.50 % Science and Technology 31, 589 1.99 % Social Sciences 2, 23,095 14.05 % A detailed explanation of the Odia Text Corpus will be available in the Odia Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Santosh Kumar Mohanty, Raja Kumar Naik, Pramod Kumar Rout & Kshirod Kumar Das. 2019. A Gold Standard Odia Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Punjabi Raw Text Corpus

requests (19)

1,01,25,770 Words | 2,470 Tittles | XML format | 5 domainsPunjabi is the principal and administrative language of Punjab. Punjabi is not only spoken in Punjab but in India, it is also a language of Lehnda Punjab in Pakistan. Punjabi is an Indo-Aryan language. This same the Punjabi language is being written in two epigraphs, in Gurmukhi script and Shahmukhi script. In our Eastern Punjabi, it is being used in Gurmukhi and Lehnda Punjab (Pakistan) using Shahmukhi script. Punjabi is written in Shahmukhi scripts as well. ‘Shahmukhi’ is a variant of ‘Perso-Arabic’ script. But LDC-IL Punjabi text corpus is collected in the Gurmukhi script for contemporary usage. Punjabi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. LDC-IL Punjabi Text Corpus size is 1, 01, 25,770 words drawn from 2,470 different titles. The five major domains are Aesthetics, Science & Technology, Social Science, Commerce and Mass Media.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 41,90,199 41.38 % Commerce 56,205 00.56 % Mass Media 42,74,922 42.22 % Science and Technology 3,84,078 03.79 % Social Sciences 12,20,366 12.05 % A detailed explanation of the Punjabi Text Corpus will be available in the Punjabi Raw Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon, Sarbjeet Kaur & Sandeep Singh. 2019. A Gold Standard Punjabi Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Rajasthani Raw Text Corpus

requests (6)

11,99,502 Words | 74 Titles | XML format | 3 Domains | 27 Sub-categoriesRajasthani is a broad linguistic category that encompasses a variety of dialects, including Marwari, Mewari, Mewati, Dhundhari, Harauti, Bagri, Wagdi, and Malvi, spoken across different regions of Rajasthan. The Government of India classifies Rajasthani as a Western Indo-Aryan variant of Hindi, primarily spoken within the state. The Government of India established the Linguistic Data Consortium for Indian Languages (LDC-IL) to support language development efforts. The LDC-IL Rajasthani Text Corpus is created based on key factors such as text quality, representativeness, retrievability, corpus size, and authenticity. For text collection, LDC-IL follows a standardized domain-based categorization and predefined criteria. The Rajasthani text corpus is broadly divided into literary and non-literary texts, with an emphasis on maintaining a balanced dataset. The collected data, sourced from books and magazines, undergoes verification for accuracy before being stored.Rajasthani Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.A detailed explanation of the Rajasthani Raw Text Corpus will be available in the Rajasthani Text Corpus Documentation. For any research-based citations, please use the following citations:Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. A Gold Standard Rajasthani Raw Text Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-93-4.Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary. 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-33-0...

Quickview

A Gold Standard Telugu Raw Text Corpus

requests (13)

30,10,993 Words | 859 Titles | XML format | 6 DomainsTelugu is a highly agglutinative and morphologically rich language. The actual pattern of language use in natural texts reveals the evidence of language trait. Government of India set up Linguistic Data Consortium for Indian Languages to help those who endeavor in the language development field. LDC-IL Telugu Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Telugu text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Telugu but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Telugu. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then stored.Telugu Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 1,687,968 56.06 % Commerce 45,130 1.50 % Mass Media 14,656 0.49 % Official Documents 6,708 0.22 % Science and Technology 415,102 13.79 % Social Sciences 841,429 27.95 % A detailed explanation of the Telugu Raw Text Corpus will be available in the Telugu Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Thirupal C Reddy & Gangaraju H. 2019. A Gold Standard Telugu Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Telugu Raw Text Corpus Vol. II

requests (4)

30,13,530 Words | 160 Titles | XML format | 6 Domains | 29 Sub-categoriesTelugu is a highly agglutinative and morphologically rich language. The actual pattern of language use in natural texts reveals the evidence of language trait. Government of India set up Linguistic Data Consortium for Indian Languages to help those who endeavor in the language development field. LDC-IL Telugu Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Telugu text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Telugu but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Telugu. Data has been collected from books, magazines, and government websites and it is verified to true to the original texts then stored.Telugu Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. A detailed explanation of the Telugu Raw Text Corpus will be available in the Telugu Text Corpus Documentation. For any research-based citations, please use the following citations: Dr. Modugu Kasimbabu, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan, 2025. A Gold Standard Raw Text Corpus Vol. II., Central Institute of Indian Languages, Mysore. 978-93-48633-12-5Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

A Gold Standard Urdu Raw Text Corpus

requests (16)

5161927 Words | 739 Titles | XML format | 5 domains.Urdu is one of the prominent language used in the Indian sub-continent. It belongs to the Indo-Aryan family. Urdu is influenced by Arabic and Persian. Urdu is written in the Perso-Arabic script. On the other hand region-wise Urdu language is co-existed side by side mostly in the northern part of India, north-west and eastern parts of India and Pakistan also, although understood and spoken occasionally in the rest of India. Urdu is arisen in the 10th century A.D. due to occupation relations, ethnic exchanges, relocations, and military expeditions. Urdu in India is basically developed in close contact with Persian, which was the language of administration and education during the period of Muslim rule. LDC-IL Urdu Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Urdu text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Urdu but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Urdu. Data has been collected from books, magazines, and newspapers and it is verified true to the original text.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 2616382 50.69 % Commerce 28601 0.55 % Mass Media 843477 16.34 % Science and Technology 348082 6.74 % Social Sciences 1325385 25.68 % A detailed explanation of the Urdu Text Corpus will be available in the Urdu Text Corpus documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam & Rushda Idris Khan. 2019. A Gold Standard Urdu Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Assamese Raw Speech Corpus

requests (19)

54:21:12 Hours | 32.5 GB | 304 Speakers | 37,570 Audio Segments | 48 kHz | 16 bit wav. Assamese is the official language of Assam. Its linguistic presence is widely presented in the state of Assam and some parts of Arunachal Pradesh and Nagaland.According to 2011 census, the Assamese Language is spoken by 15 million speakers.Assamese a widely spoken language does encounter several dialectal variations. The regional dialects can be broadly divided into two parts - the Eastern Group and the Western Group.LDC-IL divided the Assamese speaking areas into these four regions Xiboxagoria, Central Assam, Kamrupi, Goalparia and have collected speech data from each speaker. LDC-IL Assamese Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 304 (154 Female and 150 Male)DomainsAudio SegmentsEach DomainDurationContemporary Text (News)30417:23:25Creative Text30411:44:37Sentence75935:55:29Date Format5990:33:59Command and Control Words91184:56:49Person Name60815:38:07Place Name30441:58:33Phonetically Balanced-W465673:41:45Form and Function-Word-W539602:28:28A detailed explanation of the Assamese Speech Corpus will be available in the Assamese Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, Jahnobi Kalita, Samhita Bharadwaj, Plabita Bora, Priyanshee Adhyapak, Mustafiza Tamim, Rajesha N., Manasa G.. 2021. Assamese Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Assamese Sentence Aligned Speech Corpus

requests (6)

Dataset Description 30:18:16 hours|19.5GB |21,716 Audio Segments |304 speakers The annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Assamese Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Assamese script. This dataset spans a duration of 30:18:16 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 154 female and 150 male native Assamese speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Assamese Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:1. Syeda Mustafiza Tamim, Priyanshe Adhyapak, Rajesha N., Manasa G., Srikanth D.,Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Assamese Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-53-5.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Assamese Text to Speech Corpus

requests (2)

Assamese Text to Speech Corpus 44:49:34 hours | 28.85 GB | 32,594 Audio Segments | 2 Speakers The LDC-IL Assamese Text to Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer in Assamese script. This dataset spans a duration of 44:49:34 (hh:mm:ss) , consisting of read speech in the studio setup. The data is derived from 01 female and 01 male native Assamese speakers. A comprehensive explanation of dataset can be found in the Assamese Text to Speech Documentation. For any research-based citations, please use the following citations: Syeda Mustafiza Tamim, Prangshu Manjul, Stephen Fernandes, Nithin S., Roopashri M. R., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Assamese Text to Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-45-3. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...

Quickview

Bengali Raw Speech Corpus

requests (19)

128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Odisha and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there.LDC-IL Bengali Speech data is collected from the regions of Standard Colloquial (Central Bengal) and Barendri (North Bengal).LDC-IL Bengali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 476 (236 Female and 240 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 450 35:05:07 Creative Text 448 20:16:13 Sentence 11,239 16:05:22 Date Format 414 0:26:48 Command and Control Words 13,477 14:00:24 Person Name 9,012 4:56:22 Place Name 4,498 1:45:35 Most Frequent Word - Part 13,525 13:33:14 Most Frequent Word - Full Set 5,978 6:47:05 Phonetically Balanced 9,489 10:23:08 Form and Function - Word 4,940 5:27:41 A detailed explanation of the Bengali Speech Corpus will be available in the Bengali Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Priyanka Biswas, Arundhati Sengupta, Sankarshan Dutta & Priyanka Das. 2019. Bengali Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Bengali Sentence Aligned Speech Corpus

requests (7)

Dataset Description:69:10:03 hours | 43.3 GB | 40,240 Audio Segments | 450 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Bengali Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Bengali script. This dataset spans a duration of 69:10:03 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 223 female and 227 male native Bengali speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Bengali Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Sonali Sutradhar, Poulami Das, Rajesha N., Manasa G., Srikanth D.,Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Bengali Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-48-1.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Bodo Raw Speech Corpus

requests (14)

176:53:28 hours of 113 GB | 456 Speakers | 77443 Audio segments | 48 kHz | 16 bit wavBodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic family. It is the language of Bodos, which are the major tribes of the Indian State of Assam.Bodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic family. It is the language of Bodos, which are the major tribes of the Indian State of Assam. The LDC-IL Bodo speech data is collected from the regions of Chirang, Baksa Sonitpur Udalguri, Kamrup, Barpeta, Udalguri, Kokrajhar districts of Assam State of India which covers Bwrdwnari, Eastern, and Standard dialects. The data is collected from both the genders and different age groups.The available Speech Corpus details:Total Speakers 456 (220 Female and 236 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 411 53:47:56 Creative Text 413 26:47:07 Sentence 10,257 09:16:54 Date Format 938 01:58:08 Command and Control Words 12,348 14:19:32 Person Name 8,222 14:49:44 Place Name 4,115 05:17:14 Most Frequent Word - Part 12,397 14:34:05 Most Frequent Word - Full Set 6,994 04:30:14 Phonetically Balanced 15,999 20:07:33 Form and Function - Word 6,383 08:28:25 A detailed explanation of the Bodo Speech Corpus will be available in the Bodo Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Bridul Basumatary & Farson Daimary. 2019. Bodo Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Chhattisgarhi Raw Speech Corpus

requests (9)

Dataset Description: 138:09:27 Hours | 88.9 GB | 140 Speakers | 359 Audio Segments | 48 kHz | 16 bit wav LDC-IL has taken a positive step in its approach towards the mother tongues spoken in India, which is an indication of greater efforts to support and promote linguistic variety in the nation. Collection of Chhattisgarhi speech data is a major effort in this approach. This step towards developing language technology for Indian mother tongues will contribute to the overall enrichment and empowerment of mother tongues.The Chhattisgarhi raw speech corpus is made up of recordings of native Chhattisgarhi speakers from various parts of the state of Chhattisgarh, and it represents a wide range of Chhattisgarhi varieties as they are spoken in various locations by diverse speakers. Each speaker from various age groups recites prompt text extracts of literary and news texts. Along with this, Spontaneous Speech has also been collected.A detailed explanation of the Chhattisgarhi Raw Speech Corpus will be available in the Chhattisgarhi Raw Speech Data Documentation. For any research-based citations, please use the following citations: 1. Satyaendra Kumar Awasthi, Ankita Tiwari, Narayan Kumar Choudhary. 2023. Chhattisgarhi Raw Speech Corpus. Central Institute of Indian Languages, Mysore.2. Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Dogri Raw Speech Corpus

requests (16)

17:10:26 Hours | 11 GB speech data | 61 Speakers | 12,036 Audio segments | 48 kHz | 16 bit wav. Dogri, the language of the Dogras, belongs to the Indo-Aryan group and is the first major language of the multi-lingual region i. e. Jammu of the Jammu & Kashmir state. It derives its name from ‘Duggar’ the ancient title of this region. Dogri is a morphologically rich language having the pre-dominant word order of Subject-Object-Verb (SOV) with a flexibility to rearrange the constituents as many Indian languages allow. Dogri had its own script namely “Dogare Akkhar”or “Dogare” based on Takri script which is closely related to the Sharada script employed by Kashmiri language. This script was the official language script during the regime of Maharaja Ranbir Singh (1857-1885 AD). After the independence, the state government constituted a committee on 29th October, 1953 headed by Sh. Girdhari Lal Dogra. The committee presented a report and accordingly the state government decided to adopt Devanagari as well as Persian script for Dogri and it was incorporated in the State Constitution in 1957. The LDC-IL speech data is collected from Jammu, from both the genders and different age groups. The LDC-IL Dogri Speech data set consists of different types of datasets that are made up of words, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 61 (30 Female and 31 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 60 4:27:51 Creative Text 61 2:51:42 Sentence 1527 1:24:48 Date Format 122 0:14:07 Command and Control Words 1830 1:24:31 Person Name 1222 1:23:41 Place Name 609 0:29:10 Most Frequent Word - Part 1831 1:18:06 Most Frequent Word - Full Set 2000 1:16:27 Phonetically Balanced 2050 1:50:38 Form and Function - Word 724 0:29:25 A detailed explanation of the Dogri Speech Corpus will be available in the Dogri Raw Speech Documentation. For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Sunil Kumar Choudhary, Rajesha N.,ManasaG., 2021. Dogri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...