Search
A Gold Standard Chhattisgarhi Raw Text Corpus Vol. II
22,19,592 Words | 55 Titles | XML format | 4 Domains | 28 Sub-categoriesChhattisgarhi, a tongue of approximately 17 million people, carries profound cultural and historical significance within the region of Chhattisgarh. The Chhattisgarhi Raw Text Corpus endows an unrivaled window in documenting the colloquialisms, idioms, regional vocabularies, and grammar that are essential to establishing frameworks for linguistic processing. The Chhattisgarhi Raw Text Corpus is an extensive repository encapsulating the viable linguistic elements of Chhattisgarhi textual materials. The corpus of Chhattisgarhi text can be broadly classified as literary and non-literary texts. Data has been collected from books, magazines, newspapers and websites and it is verified to be true to the original texts and then warehoused. Chhattisgarhi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. A detailed explanation of the Chhattisgarhi Raw Text Corpus Vol. II will be available in the Chhattisgarhi Text Corpus Documentation. For any research-based citations, please use the following citations:Ankita Tiwari, Dr. Satyaendra Kumar Awasthi, Shantanu Kumar, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. A Gold Standard Chhattisgarhi Raw Text Corpus Vol. II. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-16-3.Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary. 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-33-0...
A Gold Standard Kashmiri Raw Text Corpus Vol. II
10, 13,658 words | 123 Titles | XML format | 6 domains |59 sub-categoriesA Gold Standard Kashmiri Raw Text Corpus Vol. II is a comprehensive collection of Kashmiri language texts, comprising 10, 13,658 words and 57, 28,547 characters. This corpus includes extracts from books, newspapers, and magazines, providing a diverse range of linguistic data. It serves as a valuable resource for linguistic research, language processing applications, and the preservation of the Kashmiri language. This volume has the representation of six major domains covered as compared to previous volume which has only two major domains of Aesthetics and social sciences. The corpus has been meticulously compiled and is available for access through the Linguistic Data Consortium for Indian Languages (LDC-IL). Researchers and developers can utilize this resource to enhance their understanding and applications related to the Kashmiri language. The representations of the six major domains are Aesthetics, Commerce, Mass Media, Official Document, Science and Technology and Social Science etc. A detailed explanation of the Kashmiri Text Corpus will be available in the Kashmiri Raw Text Corpus Documentation.For any research-based citations, please use the following citations:Dr. Zargar Adil Ahmad, Bi Bi Mariyam, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. A Gold Standard Kashmiri Raw Text Corpus Vol. II. Central Institute of Indian Languages, Mysore. 978-93-48633-27-9.Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...
A Gold Standard Maithili Raw Text Corpus Vol. II
8,11,680 Words | 54 Titles | XML format | 3 Domains | 21 Sub-categories The Maithili Raw Text Corpus endows an unrivaled window in documenting the colloquialisms, idioms, regional vocabularies, and grammar that are essential to establishing frameworks for linguistic processing. The Maithili Raw Text Corpus is an extensive repository encapsulating the viable linguistic elements of Maithili textual materials. The corpus of Maithili text can be broadly classified as literary and non-literary texts. Data has been collected from books and magazines and it is verified to be true to the original texts and then warehoused. Maithili Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and digitized methods. A detailed explanation of the Maithili Raw Text Corpus Vol. II will be available in the Maithili Text Corpus Documentation. For any research-based citations, please use the following citations:Shantanu Kumar, Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. A Gold Standard Maithili Raw Text Corpus Vol. II. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-01-9. Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary. 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-33-0...
A Gold Standard Rajasthani Raw Text Corpus
11,99,502 Words | 74 Titles | XML format | 3 Domains | 27 Sub-categoriesRajasthani is a broad linguistic category that encompasses a variety of dialects, including Marwari, Mewari, Mewati, Dhundhari, Harauti, Bagri, Wagdi, and Malvi, spoken across different regions of Rajasthan. The Government of India classifies Rajasthani as a Western Indo-Aryan variant of Hindi, primarily spoken within the state. The Government of India established the Linguistic Data Consortium for Indian Languages (LDC-IL) to support language development efforts. The LDC-IL Rajasthani Text Corpus is created based on key factors such as text quality, representativeness, retrievability, corpus size, and authenticity. For text collection, LDC-IL follows a standardized domain-based categorization and predefined criteria. The Rajasthani text corpus is broadly divided into literary and non-literary texts, with an emphasis on maintaining a balanced dataset. The collected data, sourced from books and magazines, undergoes verification for accuracy before being stored.Rajasthani Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.A detailed explanation of the Rajasthani Raw Text Corpus will be available in the Rajasthani Text Corpus Documentation. For any research-based citations, please use the following citations:Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. A Gold Standard Rajasthani Raw Text Corpus. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-93-4.Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary. 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-33-0...
A Gold Standard Telugu Raw Text Corpus Vol. II
30,13,530 Words | 160 Titles | XML format | 6 Domains | 29 Sub-categoriesTelugu is a highly agglutinative and morphologically rich language. The actual pattern of language use in natural texts reveals the evidence of language trait. Government of India set up Linguistic Data Consortium for Indian Languages to help those who endeavor in the language development field. LDC-IL Telugu Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Telugu text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Telugu but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Telugu. Data has been collected from books, magazines, and government websites and it is verified to true to the original texts then stored.Telugu Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. A detailed explanation of the Telugu Raw Text Corpus will be available in the Telugu Text Corpus Documentation. For any research-based citations, please use the following citations: Dr. Modugu Kasimbabu, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan, 2025. A Gold Standard Raw Text Corpus Vol. II., Central Institute of Indian Languages, Mysore. 978-93-48633-12-5Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. 978-93-48633-33-0...