97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav.
The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc.
The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers.
The available Speech Corpus details:
Total Speakers 1916 (958 Female and 958 Male)
A detailed explanation of the Multi-Lingual Raw Speech Corpus will be available in the Multilingual Raw Speech Documentation.
For any research-based citations, please use the following citations:
- Authors Narayan Kumar Choudhary, Rajesha N., Manasa G.
- Corpus Type Raw Corpus
- Catalogue Number 1281
- ISBN 978-81-948885-3-6
- Data Source On Field
- Duration 97:43:54
- # of Audio Segments 1916
- Release Date 15/06/2021
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.