Punjabi Raw Speech Corpus

Punjabi Raw Speech Corpus

0 reviews requests (11)
Catalogue Number: 1165
Stock In Stock

OverView

101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230  Audio Segments | 48 kHz | 16 bit wav. 
Please Login to see the price

Dataset Description

101:09:28 Hours | 65.5 GB | 467 Speakers | 76,230  Audio Segments | 48 kHz | 16 bit wav. 

Punjabi is one of the Indo-Aryan Language. Punjabi is a tonal language it has three tones, high-falling, low-rising, and level (neutral). As we know Punjabi is not spoken only in India it is also a language of Pakistan called Shahmukhi Punjabi. Here we are talking about only Indian Gurmukhi Punjabi. The Punjabi language has four different dialects, spoken in the different sub-regions of Punjab. The LDC-IL Punjabi Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. LDC-IL collected speech data from Malwa, Doab and Puadh regions.


The available Speech Corpus details:


Total Speakers 467(234  Female and 233 Male)


Domains

Audio Segments

Each Domain Duration


Contemporary Text (News)

448

27:07:41

Creative Text

446

19:29:15

Sentence

11,168

08:58:33

Date Format

887

00:27:53

Command and Control Words

13,274

07:49:16

Person Name

8,949

10:28:40

Place Name

4,473

03:17:02

Most Frequent Word - Part

8,889

05:21:56

Most Frequent Word - Full Set

3,988

02:52:44

Phonetically Balanced

13,939

08:56:04

Form and Function - Word

9,769

06:24:07


A detailed explanation of the Punjabi Speech Corpus will be available in the Punjabi Speech Data Documentation. 

For any research-based citations, please use the following citations:

  • Ramamoorthy, L., Narayan Choudhary, Poonam Dhillon & Sarbjeet Kaur. 2019. Punjabi Raw Speech Corpus.  Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview”  in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore.  pp. 160-174.

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Poonam Dhillon, Sarbjeet Kaur
  • Corpus Type Raw Corpus
  • Catalogue Number 1165
  • ISBN 978-81-7343-264-4
  • Data Source On Field
  • Duration 101:09:28
  • # of Audio Segments 76230
  • Release Date 04-Apr-2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User
LDC-IL Raw Text Corpora: An Overview
LDC-IL Raw Speech Corpora: An Overview

Write a review

Please login or register to review