Gujarati Raw Speech Corpus

Gujarati Raw Speech Corpus

0 reviews requests (12)
Catalogue Number: 1276
Stock In Stock
Please Login to see the price

Dataset Description

57:17:08 Hours | 37 GB | 204 Speakers| 25,712 Audio Segments | 48 kHz | 16 bit wav. 

Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra.

LDC-IL has 57:17:08 hours Gujarati raw speech data. The LDC-IL Gujarat Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 96 female and 108 male from Gujarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.

The available Speech Corpus details: 


Total Speakers 204 (96 Female and 108 Male)


Domains

Audio Segments

Each Domain Duration

Contemporary Text (News)

204

15:21:28

Creative Text

202

11:34:29

Sentence

5081

5:48:32

Date

404

0:41:39

Command and Control Words

6006

7:17:22

Person Name

4079

6:36:02

Place Name

2041

2:33:20

Most Frequent Word - Part

4236

5:18:47

Most Frequent Word – Full Set

2000

1:13:39

Phonetically Balanced

1378

0:51:50



A detailed explanation of the Gujarati Raw Speech Corpus will be available in the Gujarati Raw Speech Documentation. 

For any research-based citations, please use the following citations: 

  • Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Hiren Gadhavi R, Solanki Mahesh kumar R, Rejitha K. S., Rajesha N., Manasa, G.., 2021.  Gujarati Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.

  • Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174.
  • Item specifics

    • Authors Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Hiren Gadhavi R, Solanki Maheshkumar R., Rejitha K.S., Rajesha N., Manasa G.
    • Corpus Type Raw Corpus
    • Catalogue Number 1276
    • ISBN 978-81-948885-0-5
    • Data Source On Field
    • Duration 57:17:08
    • # of Audio Segments 25712
    • Release Date 15-Jun-2021
    • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
    Commercial User
    Non-Commercial User
    LDC-IL Raw Text Corpora: An Overview
    LDC-IL Raw Speech Corpora: An Overview

    Write a review

    Please login or register to review