Search - Tag - Speech

Quickview

Tamil Sentence Aligned Speech Corpus

requests (1)

Dataset Description: 74:57:59 hours | 46.4 GB | 48,572 Audio Segments | 433 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Tamil Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Tamil script. This dataset spans a duration of 74:57:59 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 214 female and 219 male native Tamil speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Tamil Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Amudha R., Kamaraj S., Rajesha N., Manasa G., Srikanth D., Stephen Fernandes,Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Tamil Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-26-9.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Telugu Raw Speech Corpus

requests (12)

22:43:59 Hours | 15 GB | 80 Speakers | 10,510 Audio Segments | 48 kHz | 16 bit wav. Telugu is the official language of Telangana and the Andhra Pradesh States. It belongs to the Dravidian language family. Among the Dravidian languages, Telugu is spoken by the largest population. Telugu is agglutinative in nature and its vocabulary is very much influenced by Sanskrit. LDC-IL considered Telugu has three specifically different varieties, thus collected speech data from Telangana, Rayalaseema and Coastal Andhra. The LDC-IL Telugu Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. Speech is in .wav format and Metadata is in .txt format.The available Speech Corpus details:Total Speakers 80 (24 Female and 56 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 77 8:28:19 Creative Text 77 7:01:16 Sentence 1,828 1:20:55 Date Format 142 0:13:58 Command and Control Words 2,170 1:43:49 Person Name 1,438 1:09:31 Place Name 707 0:33:24 Most Frequent Word - Part 2,162 1:31:24 Most Frequent Word - Full Set 1,909 0:41:23 A detailed explanation of the Telugu Speech Corpus will be available in the Telugu Speech Data Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary & Rajesha N. 2019. Telugu Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Urdu Raw Speech Corpus

requests (7)

99:18:21 Hours | 64.2 GB | 499 Speakers | 88,708 Audio Segments | 48 kHz | 16 bit wav. Urdu is one of the Modern Indo-Aryan languages of India. It evolved from Shaurseni Apabhramsha. It uses Persio-Arabic script. The language in a region is influenced by other languages of the region, mother tongue of the speaker, etc. The reading speed, loudness, frequency etc. also differ depending on certain factors like age, gender etc. Linguistic data consortium collected the speech corpus through fieldwork. This read data is collected from various age groups of male and female native speakers. This data includes Texts, Sentences, Date Formats, and different wordlists.The available Speech Corpus details: Total Speakers - 499 (252 Female and 247 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 431 25:35:02 Creative Text 433 19:40:11 Sentence 10,646 8:00:38 Date Format 846 0:43:37 Command and Control Words 13,580 9:21:01 Person Name 6,577 2:55:41 Place Name 4,273 1:09:17 Most Frequent Word - Part 12,802 7:46:28 Most Frequent Word - Full Set 18,927 11:38:30 Phonetically Balanced Vocabulary 13,646 8:13:20 Form and Function Word 6,547 4:14:36 A detailed explanation of the Urdu Speech Corpus will be available in the Urdu Speech Data Documentation.For any research based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam & Rushda Idris Khan. 2019. Urdu Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Urdu Sentence Aligned Speech Corpus

requests (1)

Dataset Description:50:09:56 hours | 32.3 GB | 32,384 Audio Segments | 434 speakersThe annotated speech corpus gives wide range of linguistic information especially useful to analyse phonetics. The LDC-IL Urdu Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing phonetically normalized and orthographically normalized annotations in Persio-Arabic script. This dataset spans a duration of 50:09:56 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 214 female and 219 male native Urdu speakers, encompassing diverse age groups and regions. A comprehensive explanation of the dataset can be found in the Urdu Sentence Aligned Speech Documentation.For any research-based citations, please use the following citations:1. Mansoor Khan, Shahnawaz Alam, Bi Bi Mariyam, Rajesha N., Manasa G., Srikanth D., Stephen Fernandes, Nithin S., Narayan Kumar Choudhary, Shailendra Mohan. 2023 Urdu Sentence Aligned Speech Corpus Central Institute of Indian Languages, Mysore. 978-81-19411-87-0.2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2023. Compendium of LDC-IL Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-81-19411-34-4.3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3..

Quickview

Multilingual Raw Speech Corpus

requests (11)

97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav. The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc. The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers. The available Speech Corpus details: Total Speakers 1916 (958 Female and 958 Male) Assamese 2:33:40 68 1.64 2:34:33 64 1.65 5:08:13 132 3.30 Bengali 2:38:34 56 1.59 2:47:32 61 1.69 5:26:06 117 3.29 Bodo 2:30:39 42 1.61 2:41:04 40 1.72 5:11:43 82 3.34 Dogri 1:16:44 30 0.84 1:35:00 31 1.01 2:51:44 61 1.84 Gujarati 2:32:10 45 1.63 2:30:40 42 1.61 5:02:50 87 3.25 Hindi 2:37:28 44 1.66 2:30:18 44 1.57 5:07:46 88 3.23 Kannada 2:37:06 45 1.68 2:32:50 48 1.63 5:09:56 93 3.32 Kashmiri 2:32:26 30 1.63 2:39:46 29 1.71 5:12:12 59 3.34 Konkani 2:50:24 62 1.82 2:41:25 62 1.74 5:31:49 124 3.57 Maithili 2:46:28 54 1.71 2:53:31 50 2.00 &..

Quickview

Indian English Raw Speech Corpus - Kannada Variant

requests (5)

23:43:04 Hours | 15.3 GB | 56 Speakers| 14,455 Audio Segments | 48 kHz | 16 bit wav. English language is a blend of Anglo-Saxon which is the prominent language of Britain in middle ages. It has been propagated to every corner of the world by colonists. English emerges as the most visible legacy of British in India because India was under British raj for almost two centuries and English is a part of education system here. Most of the states in India use their regional languages and do not have a common language to communicate. So English is used for inter-state communication. LDC-IL has 23 hours Indian English – Kannada Variant speech data. The LDC-IL Indian English Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 29 female and 27 Male from Kannada mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 56 (29 Female and 27 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 52 7:19:31 Creative Text 58 3:57:15 Sentence 1522 1:54:10 Date Format 106 0:04:32 Command and Control Words 2543 1:55:43 Person Name 2040 0:39:43 Place Name 762 2:38:49 Most Frequent Word - Part 1563 1:09:10 Most Frequent Word - Full Set 3999 2:49:55 Phonetically Balanced 1194 0:49:21 Form and Function - Word 616 0:24:55 A detailed explanation of the Indian English Raw Speech Corpus - Kannada Variant will be available in the Indian English Raw Speech Corpus - Kannada Variant Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Bharatha Raju A., Rejitha KS, Rajesha N., Manasa G., 2021. Indian English Raw Speech Corpus - Kannada Variant. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...