Raw Corpus for Speech
Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family.LDC-IL Bengali Speech Data set consists of different types of word list along with sentence list, running text and date format. Approximately 15 minutes of speech (per speaker) has been taken from 223 female and 227 male native speakers wit..
176:53:28 hours of 113 Gigabytes speech data | 456 Speakers | 77443 Audio segments | 48 kHz | 16 bit wavBodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic ..
Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India.LDC-IL Hindi speech data of 118 hours. The LDC-IL Hindi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.Approximately 15 minutes of speech (per ..
179:32:52 hours of 115 Gigabytes speech data | 656 Speakers | 99109 Audio segments | 48 kHz | 16 bit wavKannada is one of the Ancient Indian languages which belong to the Dravidian family. It has its own script. The language in a region is influenced by other languages of the region, the mother tongue of the speaker, etc. The reading speed, lo..
156:37:51 hours of 100 Gigabytes speech data | 504 Speakers | 72,938 Audio segments | 48 kHz | 16 bit wavKonkani belonging to the Indo-European family of languages. Konkani is the official language of Goa. However, the language is spoken widely across four states- Maharashtra, Goa, Karnataka and Kerala. Konkani is the only Indian language wr..
LDC-IL Maithili Raw speech data of 72:02:12 (hh:mm:ss) hours. The LDC-IL Maithili Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.The data is taken from 149 female and 151 Male native speakers of different age groups. Each speaker reco..
164 hours; 43670 segments; 458 speakers Malayalam is the official language of Kerala and Laccadive Islands. It belongs to the Dravidian language family. According to the formation of Kerala and the language of Travancore, Cochin and Malabar regions are influenced by different internal and external factors so LDC-IL considered Ma..
156:28:32 hours of Manipuri Raw Speech Corpus | 100 GB | 620 Speakers | 66,231 Audio segments | 48 khz | 16 bit wav Manipuri is the Administrative Language of Manipur. The development of LDC-IL Speech Data for Manipuri lies in capturing all the distinctive characteristics of speeches shared by different regional dialec..
89:17:25 hours of 58 Gigabytes speech data | 307 Speakers | 58544 Audio segments | 48 kHz | 16 bit wav.Marathi language is an Indo-Aryan language. Marathi language is prevalent from the 9th century. Standard Marathi (Puneri) is the official language of the State of Maharashtra. Standard Marathi is based on dialects used by academics and..
87:14:44 hours of 56.5 Gigabytes speech data | 350 Speakers | 48975 Audio segments | 48 kHz | 16 bit wav |Nepali belongs to the Indo-Aryan language family. Nepali is the official language of Nepal and Indian State of West Bengal and Sikkim, and spoken in the states of Uttharakhand, Assam, Arunachal Pradesh, Manipur, Mizoram and Bihar, and as..
Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). 101:09:28 hours of Punjabi speech data | 76,240 audio segments | 467 speakers | 65.5 GB | 48 kHz | 16 bit wavLDC-IL Punjabi speech data of 101 hours. The LDC-IL Punjabi Speech data set consists of di..
22:43:59 hours of 15 Gigabytes speech data | 80 Speakers | 10510 Audio segments | 48 khz | 16 bit wavApproximately 15 minutes speech (per speaker) has taken from 24 female and 56 male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.Corpus Details:Total speakers 8..
99:18:21 hours, 64.2 Gigabytes of speech data | 499 Speakers | 88,708 Audio Segments | 48 kHz | 16 bit wavUrdu is one of the Modern Indo-Aryan languages of India. It evolved from Shaurseni Apabhramsha. It uses Persio-Arabic script. The language in a region is influenced by other languages of the region, mother tongue of the s..