Telugu Raw Speech Corpus

Telugu Raw Speech Corpus

1 reviews requests (1)
Catalogue Number: 1173
Stock In Stock


22:43:59 hours of 15 Gigabytes speech data | 80 Speakers | 10510 Audio segments |
Please Login to see the price

Dataset Description

22:43:59 hours of 15 Gigabytes speech data | 80 Speakers | 10510 Audio segments | 48 khz | 16 bit wav

Approximately 15 minutes speech (per speaker) has taken from 24 female and 56 male native speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.

Corpus Details:

  • Total speakers 80 (24 Female and 56 Male.)
  • Speech in .wav format; Metadata .txt format
  • Contemporary Text (News)-77 Audio Segments - 8:28:19 hours
  • Creative Text - 77 Audio Segments - 7:10:35 hours
  • Sentence - 1828– Audio Segments - 1:39:00 hours                          
  • Date142 - Audio Segments - 0:14:49 hours   
  • Command and Control Words– 2170 Audio Segments - 1:43:49 hours
  • Person Name– 1438 Audio Segments  - 1:09:31 hours
  • Place Name- 707 Audio Segments - 0:33:24 hours
  • Most Frequent Word-Part– 2162 Audio Segments - 1:33:31 hours
  • Most Frequent Word-FullSet - 1909 Audio Segments- 0:41:23 hours


Telugu is the official language of Telangana and Andhra Pradesh States. It belongs to the Dravidian language family. Among the Dravidian languages, Telugu is spoken by the largest population. Telugu is agglutinative in nature and its vocabulary is very much influenced by Sanskrit.  LDC-IL considered Telugu has three specifically different varieties, thus collected speech data from Telangana, Rayalaseema and Coastal Andhra. The LDC-IL Telugu Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats.

A much more detailed explanation of the Telugu Speech corpus will be available in the Telugu Speech Data Documentation. 

For any research based citations, please use the following citations:

  • Ramamoorthy, L., Narayan Choudhary & Rajesha N. 2019. Telugu  Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174.

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Rajesha N.
  • Corpus Type Raw Corpus
  • Catalogue Number 1173
  • ISBN 978-81-7343-272-9
  • Data Source On Field
  • Duration 22:43:59
  • # of Audio Segments 10510
  • Release Date 04/04/2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User

Write a review

Please login or register to review