Speech

Speech type resource

Grid View:
Quickview

Kashmiri Raw Speech Corpus

requests (10)

28:10:07 Hours | 18 GB speech data | 150 Speakers | 16,380 Audio segments | 48 kHz | 16 bit wav.  Kashmiri Language belongs to Dardic group of Indo-Aryan family. It is known by names ‘Kashur’ and‘Kashmiri’. It is primarily spoken in Kashmir valley and Pir-Panchal range of Jammu region. Kashmiri language has two types of dialects i.e., regional dialects and social dialects. Apart from the Kashmiri spoken in valley itself there are other varieties of language that are spoken outside the valley and those varieties are considered as regional dialects of Kashmiri language. These regional dialects consist of Kishtawari, Poguli and Rambani. Kashmiri language has three social dialects as well which are known by the names Yamraz, Marak and Kamraz.  The LDC-IL speech data is collected from Kashmiri Valley are from Pulwama, Srinagar, and Anantnag. This data is collected from both the genders at different age groups. The LDC-IL Kashmiri Speech data consists of different types of datasets that are made up of words, sentences, running texts and date formats.  Each speaker recorded these datasets which are randomly selected from a master dataset.    The available Speech Corpus details: Total Speakers 150 (78 Female and 72 Male)   Domains Audio Segments Each Domain Duration Contemporary Text (News) 147 3:56:57 Creative Text 148 12:41:33 Sentence 3704 2:40:24 Date Format 281 0:10:36 Command and Control Words 4288 3:04:32 Person Name 2065 1:53:21 Place Name 1468 1:04:37 Most Frequent Word - Part 4279 2:38:07   A detailed explanation of Kashmiri Speech Corpus will be available in the Kashmiri Speech Data Documentation.  For any research-based citations, please use the following citations: Narayan Kumar Choudhary, Shahid Mushtaq Bhatt, Rajesha N., Manasa G., 2021. Kashmiri Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Konkani Raw Speech Corpus

requests (12)

156:37:51 Hours | 100 GB | 504 Speakers | 72,938  Audio Segments | 48 kHz | 16 bit wav.  Konkani belongs to the Indo-European family of languages. Konkani is the official language of Goa. However, the language is spoken widely across four states- Maharashtra, Goa, Karnataka and Kerala. Konkani is the only Indian language written in five different scripts - Devanagari, Roman, Kannada, Malayalam, and Persian-Arabic.  The LDC-IL speech data is collected from the regions of North Goa, South Goa, Karwar (Karnataka) and Sindhudurgh (Maharastra) from both genders and different age groups.Approximately 15 to 20 minutes of speech (per speaker) taken from 267 female and 237 male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.   The available Speech Corpus details:Total Speakers 504 (267  Female and 237 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 477 49:52:09 Creative Text 480 22:09:05 Sentence 12,050 15:51:11 Date Format 953 01:50:39 Command and Control Words 14,944 16:11:02 Person Name 9,588 15:55:43 Place Name 4,812 05:31:03 Most Frequent Word - Part 16,376 16:03:13 Most Frequent Word - Full Set 5,998 05:55:07 Phonetically Balanced 2,975 02:49:36 Form and Function - Word 4,285 04:29:03 A  detailed explanation of the Konkani Speech Corpus will be available in the Konkani Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Saurabh Varik  & Rashmi Shet Tanawade. 2019. Konkani Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Konkani Sentence Aligned Speech Corpus

requests (1)

Dataset Description: 83:19:42 hours | 53.5 GB | 34,091 Audio Segments | 487 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Konkani Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 83:19:42 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 259 female and 228 male native Konkani speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Konkani Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Saurabh Varik, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Konkani Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-62-7.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3.  Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Maithili Raw Speech Corpus

requests (13)

78:45:33 Hours  | 49.2 GB  | 306 Speakers | 45,198 Audio Segments | 48 kHz | 16 bit wavMaithili is an Indio-Aryan language, a direct descendant of Sanskrit, which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled languages of India. The LDC-IL speech data is collected from geographic dialects of Sotipura, Bajjika and Thethi dialects. It is collected from both genders and of different age groups.The available Speech Corpus details:Total Speakers 306 (150 Female and 156 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)29122:33:41Creative Text29415:34:55Sentence7,45107:08:48Date Format58500:31:41Command and Control Words8,92407:07:34Person Name5,91707:49:33Place Name2,95202:47:49Most Frequent Word - Part8,69906:56:24Most Frequent Words-FullSet5,99604:58:30Phonetically Balanced Words3,04002:26:27Form and Function Words1,04900:50:11                        A  detailed explanation of the Maithili Speech Corpus will be available in the Maithili Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Arun Kumar Singh, Dinesh Mishra & Atuleshwar Jha. 2019. Maithili Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Maithili Sentence Aligned Speech Corpus

requests (4)

Dataset Description: 41:54:30 hours | 26 GB | 21,412 Audio Segments | 300 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Maithili Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 41:54:30 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 147 female and 153 male native Maithili speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Maithili Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Shantanu Kumar, Dinesh Mishra, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023. Maithili Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-96-2.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3.  Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Malayalam Raw Speech Corpus

requests (17)

164:01:02 Hours | 105 GB | 458Speakers| 43670 Audio Segments |48 kHz | 16 bit wav.Malayalam is the official language of Kerala and Laccadive Islands. It belongs to the Dravidian language family.  According to the formation of Kerala and the language of Travancore, Cochin, and Malabar regions are influenced by different internal and external factors so LDC-IL considered Malayalam has three specifically different varieties, thus collected speech data from Thiruvananthapuram, Ernakulam, and Kozhikode. LDC-IL has 164 hours Malayalam speech data. The LDC-IL Malayalam Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 231 female and 227 Male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 458(231 Female and 227 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 449 71:29:21 Creative Text 449 54:41:20 Sentence 7,452 06:56:46 Date Format 598 00:53:45 Command and Control Words 8,923 07:09:37 Person Name 5,819 05:26:33 Place Name 2,906 02:28:24 Most Frequent Word - Part 8,763 06:51:31 Most Frequent Word - Full Set 1,979 02:08:58 Phonetically Balanced 3,096 02:40:09 Form and Function - Word 3,236  03:14:38 A detailed explanation of the Malayalam Speech Corpus will be available in the Malayalam Speech Data Documentation.For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Saritha S.L., Rejitha K.S., Sajila S. & Midhun P. G. 2019. Malayalam Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Malayalam Sentence Aligned Speech Corpus

requests (4)

Dataset Description: 123:29:55 hours | 79.6 GB | 89,269 Audio Segments | 451 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Malayalam Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Malayalam script. This dataset spans a duration of 123:29:55 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 229 female and 222 male native Malayalam speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Malayalam Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Rejitha K. S., Sajila S., Saritha S.L., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Malayalam Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-58-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3.  Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Manipuri Raw Speech Corpus

requests (9)

156:28:32   hours | 100 GB | 620 Speakers | 66,231 Audio segments | 48 khz | 16 bit wav Manipuri is the Administrative Language of Manipur. The development of LDC-IL Speech Data for Manipuri lies in capturing all the distinctive characteristics of speeches shared by different regional dialects of Manipur. In order to do so, certain linguistic features identifying regional tones and intonations, phonemic distributions, various pronunciations reflected in both regional and non-regional vocabulary items such as person names and place names etc., have been well housed based on a standard parameter of the dataset. Out of the entire dataset, each specific subset to be read by the corresponding speaker is randomly generated for ‘a read speech corpus’. In this way, each random set is read by a speaker. Limited Full Sets are made read completely by assured selected speakers in each age group. The data is collected from three regional dialects, namely Imphal, Kakching, and Awang Sekmai respectively through fieldwork. The age group ranges selected for fieldwork are ‘16 to 20’, ‘21 to 50’, and ‘above 50 years’ respectively. Equal number of male and female data is collected from each age group. The available Speech Corpus details : Total Speakers620(310  Female and 310 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 530 59:47:22 Creative Text 588 53:59:03 Sentence 10,979 10:01:41 Date Format 866 01:12:04 Command and Control Words 13,129 08:00:02 Person Name 8,789 07:14:04 Place Name 4,394 02:46:29 Most Frequent Word - Part 13,167 06:48:50 Most Frequent Word - Full Set 6,992 02:48:42 Phonetically Balanced 4,518 02:25:53 Form and Function - Word 2,279 01:23:50 A detailed explanation of the Manipuri Speech corpus will be available in the Manipuri Raw Speech Corpus Documentation.For any research based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu & Longjam Anand Singh. 2019. Manipuri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Marathi Raw Speech Corpus

requests (17)

89:17:25 Hours | 58 GB speech data | 307 Speakers | 58544 Audio segments | 48 kHz | 16 bit wav.The Marathi language is an Indo-Aryan language. The Marathi language is prevalent in the 9th century. Standard Marathi (Puneri) is the official language of the State of Maharashtra. Standard Marathi is based on dialects used by academics and the print media. It is believed that the language of the Marathi language is influenced by Sanskrit. Marathi is written in the Devanagari script. The phoneme inventory of Marathi is similar to that of many other Indo-Aryan languages.  The LDC-IL speech data is collected from the regions of Marathwada, Puneri, Vidharbh, and Goa from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details:Total Speakers 307 (156 Female and 151 Male)  Domains Audio Segments Each Domain Duration Contemporary Text (News) 302 22:26:06 Creative Text 302 13:37:34 Sentence 7,555 6:49:58 Date Format 604 0:39:57 Command and Control Words 9,068 7:50:10 Person Name 6,058 7:44:56 Place Name 3,037 2:49:32 Most Frequent Word - Part 9,104 7:22:57 Most Frequent Word - Full Set 10,987 9:53:28 Phonetically Balanced 4,609 4:10:47 Form and Function - Word 6,918 5:52:00 A  detailed explanation of the Marathi Speech Corpus will be available in the Marathi Speech Data Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Gajanan R Apine & Apurva P Betkekar. 2019. Marathi Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Marathi Sentence Aligned Speech Corpus

requests (3)

Dataset Description: 41:34:04 hours | 26.7 GB |  23,234 Audio Segments | 302 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Marathi Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 89:17:25 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 153 female and 149 male native Marathi speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Marathi Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Bhageshree K Khandale, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Marathi Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-92-4.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3.  Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Nepali Raw Speech Corpus

requests (8)

87:14:44 Hours | 56.5GB | 350 Speakers | 48975 Audio Segments | 48 kHz | 16 bit wav.Nepali belongs to the Indo-Aryan language family. Nepali is the official language of Nepal and Indian State of West Bengal and Sikkim, and spoken in the states of Uttaranchal, Assam, Arunachal Pradesh, Manipur, Mizoram and Bihar, and as well as in other countries like Myanmar, Bhutan etc. It is written in Devanagari script. The LDC-IL Nepali speech data is collected from the regions of Darjeeling, Assam and Dehradun, from both the genders and different age group. The LDC-IL Nepali Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The available Speech Corpus details:Total Speakers 350 (187  Female and 163 Male)) Domains Audio Segments Each Domain Duration Contemporary Text (News) 343 14:33:19 Creative Text 341 19:46:34 Sentence 8,583 13:45:34 Date Format 1,029 00:57:20 Command and Control Words 10,308 08:44:19 Person Name 6,878 09:15:04 Place Name 3,398 03:20:06 Most Frequent Word - Part 10,292 08:51:06 Most Frequent Word - Full Set 2,994 03:41:39 Phonetically Balanced 3,321 03:00:08 Form and Function - Word 1,488 01:19:35 A  detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai &  Rupesh Rai. 2019. Nepali Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Nepali Sentence Aligned Speech Corpus

requests (3)

Dataset Description: 43:04:23 hours | 27.7 GB | 21,481 Audio Segments | 346 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Nepali Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Devanagari script. This dataset spans a duration of 43:04:23 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 187 female and 159 male native Nepali speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Nepali Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Umesh Chamling Rai, Rupesh Rai, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Nepali Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-98-6.2.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3.  Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Odia Raw Speech Corpus

requests (7)

138:06:18 hours |  89 GB | 474 Speakers | 73,418 Audio segments | 48 kHz | 16 bit wav.Odia is an Indo-Aryan language; which is mainly spoken in the state of Odisha and also in some of the border states like West Bengal, Jharkhand, Chhatisgarh and Andhra Pradesh. It is designated with Classical Language Status by the Govt. of India. The LDC-IL Odia speech data is collected from the Central and Northern parts of Odisha from both the genders and different age groups. This data consists of different types of datasets that are made up of word lists, sentences include running texts and date formats.The available Speech Corpus details:Total Speakers 474 (239 Female and 235 Male)DomainsAudio SegmentsEach DomainDurationContemporary Text (News)44942:49:56Creative Text45019:43:50Sentence11,2488:22:57Date Format9001:27:49Command and Control Words13,49914:18:49Person Name8,9985:01:40Place Name4,49613:22:45Most Frequent Word - Part8,9949:40:04Most Frequent Word - Full Set10,98910:21:04Phonetically Balanced10,43810:05:10Form and Function - Word2,9572:52:14A detailed explanation of the Bengali Speech Corpus will be available in the Odia Raw Speech Data Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Raja Kumar Naik, Pramod Kumar Rout, Kshirod Kumar Das & Santosh Kumar Mohanty. 2021. Odia Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Punjabi Raw Speech Corpus

requests (11)

101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230  Audio Segments | 48 kHz | 16 bit wav. Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. Here we are talking about only Indian Gurmukhi Punjabi. The Punjabi language has four different dialects, spoken in the different sub-regions of Punjab. The LDC-IL Punjabi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. LDC-IL collected speech data from Malwa, Doab and Puadh regions.The available Speech Corpus details:Total Speakers 467(234  Female and 233 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 448 27:07:41 Creative Text 446 19:29:15 Sentence 11,168 08:58:33 Date Format 887 00:27:53 Command and Control Words 13,274 07:49:16 Person Name 8,949 10:28:40 Place Name 4,473 03:17:02 Most Frequent Word - Part 8,889 05:21:56 Most Frequent Word - Full Set 3,988 02:52:44 Phonetically Balanced 13,939 08:56:04 Form and Function - Word 9,769 06:24:07 A detailed explanation of the Punjabi Speech Corpus will be available in the Punjabi Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon & Sarbjeet Kaur. 2019. Punjabi Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Quickview

Tamil Raw Speech Corpus

requests (8)

139:11:41 Hours | 86 GB speech data | 452 Speakers | 60,287 Audio segments | 48 kHz | 16 bit wav. Tamil is one of the longest-surviving classical languages in the world.  It is one of the prominent language among the Dravidian language family. Tamil is widely spoken in the state of Tamil Nadu, Union Territory of Pondicherry, Sri Lanka, in East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china, Fiji, in South-Africa, British Guinea and in islands like Mauritius and Madagascar etc. The language is an official language in Tamil Nadu and some of the foreign countries such as Sri Lanka and Singapore. It has official status in the Indian state of Tamil Nadu and the Indian Union Territory of Pondicherry. Tamil has its own font. The language is highly agglutinative in nature. Tamil has Phonological simplicity, Morphological parity and primitiveness. There is separability and significance of all affixes in Tamil language. There is an absence nominative case termination and arbitrary words in Tamil language. The LDC-IL speech data is collected from the regions of Kongu, Kumari, Madurai, Nellai, Salem and Thanjai, from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.  The available Speech Corpus details:Total Speakers 452 (214 Female and 219 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 433 57:53:48 Creative Text 429 14:21:31 Sentence 10,764 14:51:03 Date Format 842 01:20:17 Command and Control Words 12,882 12:57:06 Person Name 8,755 03:57:29 Place Name 4,002 10:34:38 Most Frequent Word - Part 12,813 11:14:05 Most Frequent Word - Full Set 2,000 02:26:05 Phonetically Balanced 3,860 04:55:10 Form and Function - Word 3,507 04:40:29 A  detailed explanation of the Nepali Speech Corpus will be available in the Nepali Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Thennarasu S, Prem Kumar L R, Amudha R, Prabagaran R, Srikanth D. 2021.  Tamil Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.Narayan Choudhary,  Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174...

Showing 16 to 30 of 37 (3 Pages)