Search - Tag - Manipuri

Quickview

A Gold Standard Manipuri Raw Text Corpus

requests (21)

61,45,278 words | 4,31,27,842 characters | 6 DomainsManipuri Text Corpus is encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from contemporary texts in a typed method. LDC-IL Manipuri Text Corpus size is 6145278 words drawn from 1202 different titles. The six major domains are Aesthetics, Commerce, Mass Media, Official Documents, Science & Technology and Social Sciences respectively. The available Text Corpus Details:DomainsWordsPercentage of TotalCorpusAesthetics 37,72,99461.40 %Commerce18,4500.30 %Mass Media7,75,26112.62 %Official4,42,9507.21 %Science and Technology3,04,5454.96 %Social Sciences8,31,07813.52 %A detailed explanation of the Manipuri Text Corpus will be available in the Manipuri Text Corpus Documentation.For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu, Longjam Anand Singh & M. Bidyarani Devi. 2019. A Gold Standard Manipuri Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Manipuri Raw Speech Corpus

requests (12)

156:28:32 hours | 100 GB | 620 Speakers | 66,231 Audio segments | 48 khz | 16 bit wav Manipuri is the Administrative Language of Manipur. The development of LDC-IL Speech Data for Manipuri lies in capturing all the distinctive characteristics of speeches shared by different regional dialects of Manipur. In order to do so, certain linguistic features identifying regional tones and intonations, phonemic distributions, various pronunciations reflected in both regional and non-regional vocabulary items such as person names and place names etc., have been well housed based on a standard parameter of the dataset. Out of the entire dataset, each specific subset to be read by the corresponding speaker is randomly generated for ‘a read speech corpus’. In this way, each random set is read by a speaker. Limited Full Sets are made read completely by assured selected speakers in each age group. The data is collected from three regional dialects, namely Imphal, Kakching, and Awang Sekmai respectively through fieldwork. The age group ranges selected for fieldwork are ‘16 to 20’, ‘21 to 50’, and ‘above 50 years’ respectively. Equal number of male and female data is collected from each age group. The available Speech Corpus details : Total Speakers620(310 Female and 310 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 530 59:47:22 Creative Text 588 53:59:03 Sentence 10,979 10:01:41 Date Format 866 01:12:04 Command and Control Words 13,129 08:00:02 Person Name 8,789 07:14:04 Place Name 4,394 02:46:29 Most Frequent Word - Part 13,167 06:48:50 Most Frequent Word - Full Set 6,992 02:48:42 Phonetically Balanced 4,518 02:25:53 Form and Function - Word 2,279 01:23:50 A detailed explanation of the Manipuri Speech corpus will be available in the Manipuri Raw Speech Corpus Documentation.For any research based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu & Longjam Anand Singh. 2019. Manipuri Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Manipuri Sentence Aligned Speech Corpus (Bengali Script)

requests (1)

116:34:24 hours | 75.9 GB | 60,819 Audio Segments | 589 speakersThe LDC-Manipuri Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing orthographically normalized annotation in Bengali script. This dataset spans a duration of 116:34:24 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 295 female and 294 male native Manipuri speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Manipuri Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:Amom Nandaraj Meetei, Yumnam ,Premila Chanu, Rajesha N, Manasa,G, Stephen Fernandes, Nithin S, Roopashri M.R ,Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Manipuri Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-68-2..

Quickview

Manipuri Sentence Aligned Speech Corpus (Meetei Mayek)

requests (1)

116:34:24 hours | 75.9 GB | 60,819 Audio Segments | 589 speakersThe LDC-Manipuri Sentence Aligned Speech dataset comprises audio files in wav format, accompanied by a corresponding textual layer containing orthographically normalized annotation in Meetei Mayek. This dataset spans a duration of 116:34:24 (hh:mm:ss), consisting of read speech with continuous text, representative sentences, and date formats. The data is derived from 295female and 294 male native Manipuri speakers, encompassing diverse age groups and regions. A comprehensive explanation of dataset can be found in the Manipuri Sentence Aligned Speech Documentation. For any research-based citations, please use the following citations:Amom Nandaraj Meetei, Yumnam,Premila Chanu, Rajesha N., Manasa,G., Stephen Fernandes, Nithin S.,Roopashri M.R.,Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. Manipuri Sentence Aligned Speech Corpus. Central Institute of Indian Languages, Mysore. 978-93-48633-96-5..