Central Institute of Indian Languages

Grid View:
Quickview

A Gold Standard Tamil Raw Text Corpus

requests (3)

1,09,31,902 Words | 1,963 Titles | XML format |  6 text domainsTamil is one of the longest-surviving Classical Languages in the world. It is a Dravidian Language Family. Tamil Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML fo..

Quickview

A Gold Standard Telugu Raw Text Corpus

requests (2)

 30,10,993 Words | 859 Titles | XML format | 6 DomainsTelugu is a highly agglutinative and morphologically rich language. The actual pattern of language use in natural texts reveals the evidence of language trait.  Government of India set up Linguistic Data Consortium for Indian Languages to help those who endeavor in the language dev..

Quickview

A Gold Standard Urdu Raw Text Corpus

requests (5)

5161927  Words | 739 Titles | XML format | 5 domains.Urdu is one of the prominent language used in the Indian sub-continent. It belongs to the Indo-Aryan family. Urdu is influenced by Arabic and Persian. Urdu is written in the Perso-Arabic script. On the other hand region-wise Urdu language is co-existed side by side mostly in the no..

Quickview

Bengali Raw Speech Corpus

requests (2)

128:46:59 Hours | 81.2 GB | 476 Speakers | 73,470 Audio Segments | 48 kHz | 16 bit wav. Bengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of s..

Quickview

Bodo Raw Speech Corpus

requests (4)

176:53:28 hours of 113 GB | 456 Speakers | 77443 Audio segments | 48 kHz | 16 bit wavBodo, one of the scheduled language of India, is one of the Tonal languages of the world. There are two clearly distinguishable kinds of tones in Bodo which are known as Low and High. The language belongs to the Tibeto Burmese linguistic fami..

Quickview

Hindi Raw Speech Corpus

requests (15)

118:40:03 Hours | 75.1 GB | 489 Speakers | 73695 Audio Segments | 48 kHz | 16 bit wav.Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh..

Quickview

Kannada Raw Speech Corpus

requests (4)

179:32:52 hours of 115 GB | 656 Speakers | 99109 Audio segments | 48 kHz | 16 bit wavKannada is one of the Ancient Indian languages belong to the Dravidian family. It has its own script. The language in a region is influenced by other languages of the region, the mother tongue of the speaker, etc. The reading speed, loudness, frequency etc al..

Quickview

Konkani Raw Speech Corpus

requests (5)

156:37:51 Hours | 100 GB | 504 Speakers | 72,938  Audio Segments | 48 kHz | 16 bit wav.  Konkani belongs to the Indo-European family of languages. Konkani is the official language of Goa. However, the language is spoken widely across four states- Maharashtra, Goa, Karnataka and Kerala. Konkani is the only Indian language written in..

Quickview

Maithili Raw Speech Corpus

requests (4)

72:02:12 (44.8GB) Hours | 300 Speakers | 35109 Audio Segments | 48 kHz | 16 bit wavMaithili is an Indio-Aryan language, a direct descendant of Sanskrit, which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled languages of India. The LDC-IL speech data is collected from geographic dialects of So..

Quickview

Malayalam Raw Speech Corpus

requests (3)

164:01:02 Hours | 65.5 GB | 458Speakers| 43670 Audio Segments |48 kHz | 16 bit wav.Malayalam is the official language of Kerala and Laccadive Islands. It belongs to the Dravidian language family.  According to the formation of Kerala and the language of Travancore, Cochin, and Malabar regions are influenced by different internal and exter..

Quickview

Manipuri Raw Speech Corpus

requests (2)

156:28:32   hours | 100 GB | 620 Speakers | 66,231 Audio segments | 48 khz | 16 bit wav Manipuri is the Administrative Language of Manipur. The development of LDC-IL Speech Data for Manipuri lies in capturing all the distinctive characteristics of speeches shared by different regional dialects of Manipur. In order to do ..

Quickview

Marathi Raw Speech Corpus

requests (5)

89:17:25 Hours | 58 GB speech data | 307 Speakers | 58544 Audio segments | 48 kHz | 16 bit wav.The Marathi language is an Indo-Aryan language. The Marathi language is prevalent in the 9th century. Standard Marathi (Puneri) is the official language of the State of Maharashtra. Standard Marathi is based on dialects used by academics an..

Quickview

Nepali Raw Speech Corpus

requests (2)

87:14:44 Hours | 56.5GB | 350 Speakers | 48975 Audio Segments | 48 kHz | 16 bit wav.Nepali belongs to the Indo-Aryan language family. Nepali is the official language of Nepal and Indian State of West Bengal and Sikkim, and spoken in the states of Uttaranchal, Assam, Arunachal Pradesh, Manipur, Mizoram and Bihar, and as well as in other cou..

Quickview

Punjabi Raw Speech Corpus

requests (4)

101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230  Audio Segments | 48 kHz | 16 bit wav. Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. He..

Quickview

Telugu Raw Speech Corpus

requests (3)

22:43:59 Hours | 15 GB | 80 Speakers | 10,510  Audio Segments | 48 kHz | 16 bit wav. Telugu is the official language of Telangana and the Andhra Pradesh States. It belongs to the Dravidian language family. Among the Dravidian languages, Telugu is spoken by the largest population. Telugu is agglutinative in nature and its vocabulary is ..

Showing 16 to 30 of 31 (3 Pages)