Search
Multilingual Raw Speech Corpus
97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav. The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc. The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers. The available Speech Corpus details: Total Speakers 1916 (958 Female and 958 Male) Assamese 2:33:40 68 1.64 2:34:33 64 1.65 5:08:13 132 3.30 Bengali 2:38:34 56 1.59 2:47:32 61 1.69 5:26:06 117 3.29 Bodo 2:30:39 42 1.61 2:41:04 40 1.72 5:11:43 82 3.34 Dogri 1:16:44 30 0.84 1:35:00 31 1.01 2:51:44 61 1.84 Gujarati 2:32:10 45 1.63 2:30:40 42 1.61 5:02:50 87 3.25 Hindi 2:37:28 44 1.66 2:30:18 44 1.57 5:07:46 88 3.23 Kannada 2:37:06 45 1.68 2:32:50 48 1.63 5:09:56 93 3.32 Kashmiri 2:32:26 30 1.63 2:39:46 29 1.71 5:12:12 59 3.34 Konkani 2:50:24 62 1.82 2:41:25 62 1.74 5:31:49 124 3.57 Maithili 2:46:28 54 1.71 2:53:31 50 2.00 &..