Raw Corpus

Raw Corpus for Text

Quickview

A Gold Standard Assamese Raw Text Corpus

requests (13)

1,01,27,030 Words | 1,084 Tittles | XML format | 6 domains Assamese or Oxomiya is the language spoken by the natives of the state of Assam in Northeast India. It is also the official language of Assam. It is spoken in some parts of Arunachal Pradesh, Nagaland and in other Northeast Indian states. However, small pockets of Assamese speakers can also be found in Bhutan and Bangladesh. LDC-IL Assamese Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Assamese text can be broadly classifieds as literary and non- literary texts. A huge amount of literary texts are available in Assamese but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Assamese. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused. Assamese Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded in Metadata information. The corpus has been created from the contemporary text in typed and crawled methods. The available Text Corpus details: Domain Domain Word Count Percentage Aesthetics 5233452 51.68% Commerce 66924 0.66% Mass Media 3354996 33.13% Official Document 1298 0.01% Science and Technology 372790 3.68% Social Sciences 1097570 10.84% Total 10127030 100.00% A detailed explanation of the Assamese Text Corpus will be available in the Assamese Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy L., Narayan Kumar Choudhary, Atreyee Sharma, JahnobiKalita, SamhitaBharadwaj, TazninHussain, PriyansheAdhyapak,SyedaMustafizaTamim, Rajesha N., Manasa. G. 2021. A Gold Standard Assamese Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Bengali Raw Text Corpus

requests (23)

42,37,440 Words | 1,460 Tittles | XML format | 3 domainsBengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken over the whole of West Bengal, Tripura and Bangladesh and in some parts of Bihar, Orissa, and Assam. Bengali refugees, who have settled in Andaman after 1950, have also carried the language there. LDC-IL Bengali Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Bengali text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Bengali but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Bengali. Data has been collected from books, magazines, and newspapers and it is verified true to the original text.Bengali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in a typed method. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 40,37,854 95.29 % Science and Technology 76,231 1.80 % Social Sciences 1,23,355 2.91 % A detailed explanation of the Bengali Text Corpus will be available in the Bengali Raw Text Corpus Documentation.For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Sonali Sutradhar, Arundhati Sengupta, Sankarshan Dutta, Priyanka Das & Saswati Karmakar. 2019. A Gold Standard Bengali Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Bodo Raw Text Corpus

requests (26)

29,15,544 Words | 80 Tittles | XML format | 5 domainsBodo is a major tribal language that belongs to the Tibeto-Burman language family. Bodo language is spoken in Assam and other parts of North-East India. The Bodo language is one of the major language of Assam and official language in the Bodoland Territorial Area Districts. Several rivers like Dihing, Dibru, Dihong, Dikrai, etc. in the North-East region were termed after some Bodo words which reveal the three-dimensional distribution arrangement of connected ethnocultural groups with then cultural personae and occurrence. Bodo is written in Devanagari. The Bodo text corpus is extracted from contemporary text sources. LDC-IL Bodo Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. Bodo text corpus is collected from various libraries in Assam mostly from Kokrajhar, Chirang, Baksa, Udalguri, and Guwahati. LDC-IL attempts to develop balanced text corpora of Bodo. Data has been collected from books, magazines, and newspapers and it is verified true to the original text. Bodo Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus Details:DomainsWordsPercentage of TotalCorpusAesthetics 4,74,96016.29 %Commerce25,0640.86 %Mass Media16,79,51157.61 %Science and Technology1,72,1515.90 %Social Sciences5,63,85819.34 %A detailed explanation of the Bodo Text Corpus will be available in the Bodo Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Bridul Basumatary & Farson Daimary. 2019. A Gold Standard Bodo Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Chhattisgarhi Raw Text Corpus Vol. II

requests (8)

22,19,592 Words | 55 Titles | XML format | 4 Domains | 28 Sub-categoriesChhattisgarhi, a tongue of approximately 17 million people, carries profound cultural and historical significance within the region of Chhattisgarh. The Chhattisgarhi Raw Text Corpus endows an unrivaled window in documenting the colloquialisms, idioms, regional vocabularies, and grammar that are essential to establishing frameworks for linguistic processing. The Chhattisgarhi Raw Text Corpus is an extensive repository encapsulating the viable linguistic elements of Chhattisgarhi textual materials. The corpus of Chhattisgarhi text can be broadly classified as literary and non-literary texts. Data has been collected from books, magazines, newspapers and websites and it is verified to be true to the original texts and then warehoused. Chhattisgarhi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. A detailed explanation of the Chhattisgarhi Raw Text Corpus Vol. II will be available in the Chhattisgarhi Text Corpus Documentation. For any research-based citations, please use the following citations:Ankita Tiwari, Dr. Satyaendra Kumar Awasthi, Shantanu Kumar, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. A Gold Standard Chhattisgarhi Raw Text Corpus Vol. II. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-16-3.Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary. 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-33-0...

Quickview

A Gold Standard Dogri Raw Text Corpus

requests (26)

8,01,771 Words | 183 Tittles | XML format | 05 Text DomainsDogri is an Indo-Aryan language spoken by about five million people in India and Pakistan, particularly in the Jammu region of Jammu and Kashmir and Himachal Pradesh, also in northern Punjab, other parts of Jammu and Kashmir. Dogri was originally written using the Dogri script which is very close to the Takriscript. The language is now more commonly written in Devanagari in India, and in the Nastaʿliq form of Perso-Arabic in Pakistan and Pakistani-administered Kashmir. Dogri has several varieties, all with greater than 80% lexical similarity (within Jammu and Kashmir). Before gaining language status, per the Census of India, Dogri was classified as one of the many varieties of Punjabi, such as Majhi or Doabi. Dogri text corpus is collected from various libraries in Jammu and Kashmir, mostly from Jammu. The greater part of the text has been taken from the library of Department of Dogri, Jammu University, Jammu University Library, J&K Academy of Arts, Culture and Languages and Dogri Sansatha-Jammu.The available Text Corpus details:DomainsWordsPercentage of TotalCorpusAesthetics 5,94,60974.16 %Commerce1,3500.17 %Mass Media1,56,75619.55 %Science and Technology2,7300.34 %Social Sciences46,3265.78 %A detailed explanation of the Dogri Text Corpus will be available in the Dogri Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary & Sunil Kumar. 2019. A Gold Standard Dogri Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Gujarati Raw Text Corpus

requests (19)

28, 62,413 Words | 1,364 Tittles | XML format | 06 Text DomainsGujarati is a major Indo-Aryan language and the administrative language of Gujarat, Union territories of Daman and Diu and Dadra and Nagar Haveli. LDC-IL Gujarati Raw Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc.. It’s encoded in a machine-readable form and stored in a standard format. All encoding being used is Unicode compatible fonts and stored in XML format. The data is embedded with metadata information. The corpus has been developed from the contemporary texts in a typed method. The corpus of Gujarati raw text can be generally classified as literary and non- literary texts. Huge amount of literary texts are available in Gujarati but knowledge/scientific texts are less, thus LDC-IL attempted to develop a balanced raw text corpus of Gujarati. Data has been collected from the books and the newspapers and it is verified to true to the original texts. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 7,42,260 25.93 % Commerce 43,733 1.53 % Mass Media 10,70,099 37.38 % Official Document 29,599 1.03 % Science and Technology 6,43,737 22.49 % Social Sciences 3,32,985 11.63 % A detailed explanation of the Gujarati Text Corpus will be available in the Gujarati Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Mona Parakh, Purva S Dholakia., Gadhavi R Hiren & Maheshkumar R Solanki. 2019. A Gold Standard Gujarati Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Hindi Raw Text Corpus

requests (31)

1,03,17,177 Words | 1,223 Tittles | XML format | 4 domains Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh. LDC-IL Hindi Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Hindi text can be broadly classified as literary and non- literary texts. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused. Hindi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 38,22,697 37.05 % Mass Media 50,12,327 48.58 % Science and Technology 5,49,143 5.32 % Social Sciences 9,33,010 9.04 % A detailed explanation of the Hindi Text Corpus will be available in the Hindi Raw Text Corpus Documentation. For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Jitendra Kumar Singh, Richa, Anjali Sinha, Dheeraj Kumar Mishra, Arimardan Kumar Tripathi, Aditi Debsharma, Satyaendra Kumar Awasthi & Madhupriya Pathak. 2019. A Gold Standard Hindi Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Kashmiri Raw Text Corpus

requests (24)

4,66,054 Words | 108 Tittles | XML format | 2 domainsKashmiri language is one of the 22 scheduled languages of India and is the part of the Eighth Schedule in the constitution of Jammu and Kashmir. It belongs to the Dardic group of Indo-Aryan Language family. Like other Indo-Aryan languages, Kashmiri also comprises of many dialects. The Kashmiri language was traditionally written in Sharda Script after the 8th Century A.D. However, with the passage of time Devanagri and Perso-Arabic scripts were adapted to write the Kashmiri language. The Kashmiri text can be broadly classified into two types: literary text and non-literary text. LDCIL tried to cover the entire categories in the standard list. Some categories like Novel, Short Stories Criticism and Literature have a huge number of books, but some categories like Epic, Letters, Administration, Botany, Physics, Chemistry, Zoology and Legislature have a very less number of books.Kashmiri text has been typed in Unicode by using the In Script Keyboard in XML files. Metadata information has also been provided along with the data. The corpus has been developed from the available contemporary text. Kashmiri Text Corpus in LDC-IL comprises 466,054 Words and character count is 2646948, drawn from books, newspapers, and magazines. The representations of the two major domains are Aesthetics and Social Sciences etc. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 4,00,474 85.93 % Social Sciences 65,580 14.7 % A detailed explanation of the Kashmiri Text Corpus will be available in the Kashmiri Raw Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary & Shahid Mushtaq Bhat. 2019. A Gold Standard Kashmiri Raw Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Kashmiri Raw Text Corpus Vol. II

requests (3)

‎10, 13,658 words | 123 Titles | XML format | 6 domains |59 sub-categoriesA Gold Standard Kashmiri Raw Text Corpus Vol. II is a comprehensive collection of ‎Kashmiri ‎language texts, comprising 10, 13,658 words and 57, 28,547 characters. This corpus ‎includes ‎extracts from books, newspapers, and magazines, providing a diverse range of ‎linguistic data. It ‎serves as a valuable resource for linguistic research, language processing ‎applications, and the ‎preservation of the Kashmiri language. This volume has the ‎representation of six major domains ‎covered as compared to previous volume which has only ‎two major domains of Aesthetics and ‎social sciences. The corpus has been meticulously ‎compiled and is available for access through the ‎Linguistic Data Consortium for Indian ‎Languages (LDC-IL). Researchers and developers can utilize ‎this resource to enhance their ‎understanding and applications related to the Kashmiri language. The representations of the ‎six major domains are Aesthetics, Commerce, Mass Media, Official Document, Science and ‎Technology and Social Science etc.‎ A detailed explanation of the Kashmiri Text Corpus will be available in the Kashmiri Raw Text ‎Corpus Documentation.‎For any research-based citations, please use the following citations:‎‎Dr. Zargar Adil Ahmad, Bi Bi Mariyam, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, ‎Prof. Shailendra Mohan. 2025. A Gold Standard Kashmiri Raw Text Corpus Vol. II. Central Institute of ‎Indian Languages, Mysore. 978-93-48633-27-9.‎‎Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2025. LDC-IL Corpus Insights. Central Institute ‎of Indian Languages, Mysore. 978-93-48633-33-0.‎..

Quickview

A Gold Standard Konkani Raw Text Corpus

requests (24)

39,95,611 Words | 282 Tittles | XML format | 4 domainsKonkani is the principal and administrative language of Goa. Konkani is an Indo-Aryan language belonging to the Indo-European family of languages and is spoken along the western coast of India. The Konkani language is spoken widely in the western coastal region of India is known as Konkan. This consists of the Konkan division of Maharashtra, the state of Goa, and the Uttara Kannada (formerly North Canara), Udupi, and Dakshina Kannada (formerly South Canara) districts of Karnataka, together with many districts in Kerala (such as Kasargod, Kochi, Alappuzha, Trivandrum, and Kottayam). LDC-IL Konkani Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Konkani text can be broadly classifieds as literary and non- literary texts. A huge amount of literary texts are available in Konkani but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Konkani. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then warehoused. Konkani Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded in Metadata information. The corpus has been created from the contemporary text in typed and crawled methods..The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 17,70,477 44.31 % Mass Media 20,16,151 50.46 % Science and Technology 1,04,471 2.61 % Social Sciences 1,04,512 2.62 % A detailed explanation of the Konkani Text Corpus will be available in the Konkani Raw Text Corpus Documentation.For any research-based citations, please use the following citations: Ramamoorthy, L., Narayan Choudhary, Saurabh Varik, Rashmi Shet Tanawade & Yashwant D Gawas. 2019. A Gold Standard Konkani Text Corpus. Central Institute of Indian Languages, Mysore. Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Maithili Raw Text Corpus Vol. II

requests (1)

8,11,680 Words | 54 Titles | XML format | 3 Domains | 21 Sub-categories The Maithili Raw Text Corpus endows an unrivaled window in documenting the colloquialisms, idioms, regional vocabularies, and grammar that are essential to establishing frameworks for linguistic processing. The Maithili Raw Text Corpus is an extensive repository encapsulating the viable linguistic elements of Maithili textual materials. The corpus of Maithili text can be broadly classified as literary and non-literary texts. Data has been collected from books and magazines and it is verified to be true to the original texts and then warehoused. Maithili Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and digitized methods. A detailed explanation of the Maithili Raw Text Corpus Vol. II will be available in the Maithili Text Corpus Documentation. For any research-based citations, please use the following citations:Shantanu Kumar, Ankita Tiwari, Rajesha N., Manasa G., Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2025. A Gold Standard Maithili Raw Text Corpus Vol. II. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-01-9. Dr. Rejitha K. S., Dr. Narayan Kumar Choudhary. 2025. LDC-IL Corpus Insights. Central Institute of Indian Languages, Mysore. ISBN: 978-93-48633-33-0...

Quickview

A Gold Standard Malayalam Raw Text Corpus

requests (15)

63, 70,954 Words | 1,119 Titles | XML format | 6 domainsMalayalam is a highly agglutinative and morphologically rich language. The actual pattern of language use in natural texts reveals the evidence of language trait. Government of India set up Linguistic Data Consortium for Indian Languages to help those who endeavor in the language development field. LDC-IL Malayalam Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Malayalam text can be broadly classified as literary and non-literary texts. A huge amount of literary texts are available in Malayalam but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Malayalam. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts then stored.Malayalam Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from contemporary text is typed and crawled methods. LDC-IL Malayalam Text Corpus size is 63, 70,954 words drawn from 1,119 different titles. The six major domains are Aesthetics, Commerce, Official Documents, Social Sciences, Mass Media and Science & Technology. The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 25,77,090 40.45 % Commerce 3,13,135 4.92 % Official Documents 7,733 0.12 % Mass Media 21,35,621 13.74 % Science and Technology 16,79,511 33.52 % Social Sciences 8,75,568 7.25 % A detailed explanation of the Malayalam Text Corpus will be available in the Malayalam Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Saritha S.L., Rejitha K.S. & Sajila S. 2019. A Gold Standard Malayalam Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Manipuri Raw Text Corpus

requests (25)

61,45,278 words | 4,31,27,842 characters | 6 DomainsManipuri Text Corpus is encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from contemporary texts in a typed method. LDC-IL Manipuri Text Corpus size is 6145278 words drawn from 1202 different titles. The six major domains are Aesthetics, Commerce, Mass Media, Official Documents, Science & Technology and Social Sciences respectively. The available Text Corpus Details:DomainsWordsPercentage of TotalCorpusAesthetics 37,72,99461.40 %Commerce18,4500.30 %Mass Media7,75,26112.62 %Official4,42,9507.21 %Science and Technology3,04,5454.96 %Social Sciences8,31,07813.52 %A detailed explanation of the Manipuri Text Corpus will be available in the Manipuri Text Corpus Documentation.For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu, Longjam Anand Singh & M. Bidyarani Devi. 2019. A Gold Standard Manipuri Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Marathi Raw Text Corpus

requests (20)

21,57,109 Words | 678 Tittles | XML format | 5 domainsMarathi is an Indo-Aryan language. It is the official language of Maharashtra state of India. Marathi is primarily spoken in Maharashtra (India) and parts of neighboring states of Gujarat, Madhya Pradesh, Goa, Karnataka (Particularly the bordering districts of Belgaum, Bidar, Gulbarga, and Uttara Kannada), union-territories of Daman and Diu and Dadra and Nagar Haveli. LDC-IL Marathi Text Corpus developed according to various factors such as quality of the text, representativeness, retrievable format, size of corpus, authenticity, etc. For collecting text corpus LDC-IL adopts a standard category list of various domains and a prior set of criteria. The corpus of Marathi text can be broadly classified as literary and non- literary texts. A huge amount of literary texts are available in Marathi but scientific texts are less thus LDC-IL attempts to develop balanced text corpora of Marathi. Data has been collected from books, magazines, and newspapers and it is verified to true to the original texts than warehoused. Marathi Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 15,15,039 70.23 % Commerce 20,795 0.97 % Mass Media 3,63,120 16.83 % Science and Technology 55,902 2.59 % Social Sciences 2,02,253 9.38 % A detailed explanation of the Marathi Text Corpus will be available in the Marathi Raw Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Gajanan R Apine & Apurva P Betkekar. 2019. A Gold Standard Marathi Raw Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...

Quickview

A Gold Standard Nepali Raw Text Corpus

requests (12)

70,57,524 Words | 1,347 Tittles | XML format | 6 domains Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of the North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an official language of Nepal. About a quarter of the population in Bhutan speaks Nepali. Nepali Text Corpus encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from the contemporary text in typed and crawled methods.The available Text Corpus details: Domains Words Percentage of Total Corpus Aesthetics 40,72,977 57.71 % Commerce 30,354 0.43 % Mass Media 22,71,064 32.18 % Official Documents 2,426 0.03 % Science and Technology 80,306 1.14 % Social Sciences 6,00,397 8.51 % A detailed explanation of the Nepali Raw Text Corpus will be available in the Nepali Text Corpus Documentation. For any research-based citations, please use the following citations:Ramamoorthy, L., Narayan Choudhary, Samar Sinha, Jeena Rai, Umesh Chamling Rai & Rupesh Rai. 2019. A Gold Standard Nepali Text Corpus. Central Institute of Indian Languages, Mysore.Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10...