Search - Tag - Gujarati

Quickview

A Gold Standard Gujarati Raw Text Corpus

requests (16)

28, 62,413 Words | 1,364 Tittles | XML format | 06 Text DomainsGujarati is a major Indo-Aryan language and the administrative language of Gujarat, Union territories of Daman and Diu and Dadra and Nagar Haveli. LDC-IL Gujarati Raw Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc.. It’s encoded in a machine-readable form and stored in a standard format. All encoding being used is Unicode compatible fonts and stored in XML format. The data is embedded with metadata information. The corpus has been developed from the contemporary texts in a typed method. The corpus of Gujarati raw text can be generally classified as literary and non- literary texts. Huge amount of literary texts are available in Gujarati but knowledge/scientific texts are less, thus LDC-IL attempted to develop a balanced raw text corpus of Gujarati. Data has been collected from the books and the newspapers and it is verified to true to the original texts. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 7,42,260 25.93 % Commerce 43,733 1.53 % Mass Media 10,70,099 37.38 % Official Document 29,599 1.03 % Science and Technology 6,43,737 22.49 % Social Sciences 3,32,985 11.63 % A detailed explanation of the Gujarati Text Corpus will be available in the Gujarati Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Mona Parakh, Purva S Dholakia., Gadhavi R Hiren & Maheshkumar R Solanki. 2019. A Gold Standard Gujarati Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

Gujarati Raw Speech Corpus

requests (15)

57:17:08 Hours | 37 GB | 204 Speakers| 25,712 Audio Segments | 48 kHz | 16 bit wav. Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra.LDC-IL has 57:17:08 hours Gujarati raw speech data. The LDC-IL Gujarat Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 96 female and 108 male from Gujarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.The available Speech Corpus details: Total Speakers 204 (96 Female and 108 Male)DomainsAudio SegmentsEach Domain DurationContemporary Text (News)20415:21:28Creative Text20211:34:29Sentence50815:48:32Date4040:41:39Command and Control Words60067:17:22Person Name40796:36:02Place Name20412:33:20Most Frequent Word - Part42365:18:47Most Frequent Word – Full Set20001:13:39Phonetically Balanced13780:51:50A detailed explanation of the Gujarati Raw Speech Corpus will be available in the Gujarati Raw Speech Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Hiren Gadhavi R, Solanki Mahesh kumar R, Rejitha K. S., Rajesha N., Manasa, G.., 2021. Gujarati Raw Speech Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...

Quickview

Gujarati Raw Speech Corpus(Mono Recordings)

requests (13)

64:44:02 Hours | 7.1 GB | 233 Speakers| 26,223 Audio Segments | 16 kHz | 16 bit wav. Gujarati is one of the major literary languages of India and it is the official language of Gujarat state and union territories of Daman and Diu and Dadra and Nagar Haveli. For the convenience LDC-IL considered Gujarati with four dialects namely South Gujarat, Central Gujarat, North Gujarat and Saurashtra. LDC-IL has 64:44:02 hours Gujarati raw speech data as Mono recording. The LDC-IL Gujarati Raw Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 124 female and 109 male from Guajarati mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset. The available Speech Corpus details: Total Speakers 233 (124 Female and 109 Male) Domains Audio Segments Each Domain Duration Contemporary Text (News) 233 12:52:46 Creative Text 232 13:30:15 Sentence 5824 7:12:17 Date Format 466 0:59:31 Command and Control Words 6985 9:43:07 Person Name 4644 8:34:44 Place Name 2322 3:17:06 Phonetically Balanced 4131 6:28:15 Form and Function - Word 1386 2:06:01 A detailed explanation of the Gujarati Raw Speech Corpus (Mono Recordings) will be available in the Gujarati Raw Speech (Mono Recordings) Documentation. For any research-based citations, please use the following citations: Ramamoorthy L., Narayan Kumar Choudhary, Mona Parakh, Rejitha KS, Rajesha N., Manasa, G.2021. Gujarati Raw Speech Corpus(Mono Recordings). Central Institute of Indian Languages, Mysore.Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174...