Your request cart is empty!
Dataset Description
139:11:41 Hours | 86 GB speech data | 452 Speakers | 60,287 Audio segments | 48 kHz | 16 bit wav.
Tamil is one of the longest-surviving classical languages in the world. It is one of the prominent language among the Dravidian language family. Tamil is widely spoken in the state of Tamil Nadu, Union Territory of Pondicherry, Sri Lanka, in East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china, Fiji, in South-Africa, British Guinea and in islands like Mauritius and Madagascar etc. The language is an official language in Tamil Nadu and some of the foreign countries such as Sri Lanka and Singapore. It has official status in the Indian state of Tamil Nadu and the Indian Union Territory of Pondicherry. Tamil has its own font. The language is highly agglutinative in nature. Tamil has Phonological simplicity, Morphological parity and primitiveness. There is separability and significance of all affixes in Tamil language. There is an absence nominative case termination and arbitrary words in Tamil language.
The LDC-IL speech data is collected from the regions of Kongu, Kumari, Madurai, Nellai, Salem and Thanjai, from both the genders and different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.
The available Speech Corpus details:
Total Speakers 452 (214 Female and 219 Male)
Domains |
Audio Segments |
Each Domain Duration |
Contemporary
Text (News) |
433 |
57:53:48 |
Creative
Text |
429 |
14:21:31 |
Sentence |
10,764 |
14:51:03 |
Date
Format |
842 |
01:20:17 |
Command
and Control Words |
12,882 |
12:57:06 |
Person
Name |
8,755 |
03:57:29 |
Place
Name |
4,002 |
10:34:38 |
Most
Frequent Word - Part |
12,813 |
11:14:05 |
Most
Frequent Word - Full Set |
2,000 |
02:26:05 |
Phonetically
Balanced |
3,860 |
04:55:10 |
Form
and Function - Word |
3,507 |
04:40:29 |
- Ramamoorthy, L., Narayan Choudhary, Thennarasu S, Prem Kumar L R, Amudha R, Prabagaran R, Srikanth D. 2021. Tamil Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Narayan Choudhary, Rajesha N., Manasa
G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
Item specifics
- Authors Ramamoorthy L., Narayan Kumar Choudhary, Thennarasu S, Prem Kumar L R, Amudha R, Prabagaran R, Srikanth D.
- Corpus Type Raw Corpus
- Catalogue Number 1283
- ISBN 978-93-91386-01-6
- Data Source On Field
- Duration 139:11:41
- # of Audio Segments 60,287
- Release Date 15-Jun-2021
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.