Multilingual Raw Speech Corpus

Multilingual Raw Speech Corpus

0 reviews requests (13)
Catalogue Number: 1281
Stock In Stock
Please Login to see the price

Dataset Description

97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav.  

The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc.

The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers. 

 The available Speech Corpus details:


Total Speakers 1916 (958 Female and 958 Male)


                                    Assamese    2:33:40    68   1.64  2:34:33     64  1.65 5:08:13  132  3.30
                                 Bengali       2:38:34     56    1.59      2:47:32  61      1.69   5:26:06   117 3.29
                                   Bodo             2:30:39     42     1.61     2:41:04  40      1.72   5:11:43     82 3.34
                                    Dogri       1:16:44           30     0.84     1:35:00  31     1.01    2:51:44     61 1.84
                                    Gujarati       2:32:10     45     1.63 2:30:40  42     1.61    5:02:50     87 3.25
                                    Hindi       2:37:28     44     1.66 2:30:18  44     1.57    5:07:46     88 3.23
                                    Kannada     2:37:06     45     1.68 2:32:50  48     1.63    5:09:56     93 3.32
                                    Kashmiri     2:32:26     30     1.63 2:39:46  29     1.71    5:12:12     59 3.34
                                    Konkani       2:50:24     62     1.82 2:41:25  62     1.74    5:31:49     124 3.57
                                    Maithili       2:46:28     54     1.71 2:53:31  50     2.00    5:39:59     104 3.48
                                    Malayalam   2:38:16     68     1.69 2:28:17  61     1.59    5:06:33     129 3.29
                                    Manipuri      2:15:42     29     1.45 2:44:43  32     1.76    5:00:25     61 3.22
                                    Marathi      2:38:26     56     1.70 2:41:57  58     1.73    5:20:23     114 3.43
                                    Nepali     2:51:09     44     1.83 2:58:41  52     1.91    5:49:50     96 3.75
                                    Odia     2:38:24     63     1.70 2:32:10  60     1.63    5:10:34     123 3.33
                                    Punjabi     2:41:13     67     1.72 2:35:40  62     1.66    5:16:53     129 3.40
                                    Tamil     2:35:24     78     1.57 2:45:20  70     1.66    5:20:44     148 3.24
                                    Telugu     2:06:18     24     1.33 3:00:40  38     1.93    5:06:58     62 3.27
                                    Urdu     2:20:22     53     1.50 2:48:54  54     1.81    5:09:16     107 3.31


A detailed explanation of the Multi-Lingual Raw Speech Corpus will be available in the Multilingual Raw Speech Documentation.

 

For any research-based citations, please use the following citations: 

Item specifics

  • Authors Narayan Kumar Choudhary, Rajesha N., Manasa G.
  • Corpus Type Raw Corpus
  • Catalogue Number 1281
  • ISBN 978-81-948885-3-6
  • Data Source On Field
  • Duration 97:43:54
  • # of Audio Segments 1916
  • Release Date 15-Jun-2021
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User
LDC-IL Raw Text Corpora: An Overview
LDC-IL Raw Speech Corpora: An Overview

Write a review

Please login or register to review