4237440 Words | 1460 Tittles | xml format | 3 domainsBengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoken o..
2915544 words | 80Titles | XML format | 5 text domainsBodo is a major tribal language which belongs to Tibeto-Burman language family. Bodo language is spoken in Assam and other parts of North-East India. The Bodo language is one of major language of Assam and official language in the Bodoland Territorial Area Districts. Several rivers lik..
801771 Words | 183 Tittles | xml format | 5 domains Dogri is an Indo-Aryan language spoken by about five million people in India and Pakistan, particularly in the Jammu region of Jammu and Kashmir and Himachal Pradesh, also in northern Punjab, other parts of Jammu and Kashmir. Dogri was originally written using the Dogri script ..
Gujarati Raw Text Corpus of 28, 62,413 Words | 1,369 Tittles | Data and Metadata in XML format | 06 Text DomainsGujarati is a major Indo-Aryan language and the administrative language of Gujarat, Union territories of Daman and Diu and Dadra and Nagar Haveli. LDC-IL Gujarati Raw Text Corpus developed according to various factors such as qua..
10317177 Words | 1223 Tittles | xml format | 4 domains Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pradesh. LDC-IL Hindi Text Corpus de..
77,63,124 words | 1772 Titles | Data and Metadata in XML format | 6 text domainsKannada is one of the Ancient Indian language which belongs to Dravidian family. It has its own script. Even though Kannada is considered as a classical language because of its ancient history in literature, the Kannada text corpus is extracted from contempora..
466,054 Words| 108Tittles| XML format|2 domainsKashmiri language is one of the 22 scheduled languages of India and is the part of the Eighth Schedule in the constitution of Jammu and Kashmir. It belongs to Dardic group of Indo-Aryan Language family. Like other Indo-Aryan languages, Kashmiri also comprises of many dialects. Kashmiri language w..
39,95,611 Words | 282 Titles | xml format | 4 domains.Konkani is the principal and administrative language of Goa. Konkani is an Indo-Aryan language belonging to the Indo-European family of languages and is spoken along the western coast of India. The Konkani language is spoken widely in the western coastal region of India is known as Konk..
53,16,552 Words | 499 Tittles | xml format | 5 domains Maithili is an Indio-Aryan language, a direct descendant of Sanskrit. Which is spoken in the states of Bihar, Jarkhand and part of Nepal. It is one of the scheduled language of India. LDC-IL Maithili Text Corpus developed according to various factors such a..
63, 70,954 Words | 1,119 Titles | xml format | 6 domainsMalayalam is a highly agglutinative and morphologically rich language. The actual pattern of language use in natural texts reveals the evidence of language trait. Government of India set up Linguistic Data Consortium for Indian Languages to help those who endeavor in the language developm..
6145278 words | 43127842 characters | 6 DomainsManipuri Text Corpus is encoded in a machine readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from contemporary texts in typed method. LDC-IL Manip..
2157078 Words | 678 Titles | xml format | 5 domainsMarathi is an Indo-Aryan language. It is the official language of Maharashtra state of India. Marathi is primarily spoken in Maharashtra (India) and parts of neighboring states of Gujarat, Madhya Pradesh, Goa, Karnataka (Particularly the bordering districts of Belgaum, Bidar, Gulbarga and ..
70,57,524 Words | 1,347 Tittles | xml format | 6 domains Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an ..
15, 88, 287 Words | 206 Tittles | Data and Metadata in XML format | 05 Text DomainsOdia (formerly Oriya) is a major Indo-Aryan language, which is spoken in the states of Odisha, West Bengal, Jharkhand, Chhattisgarh and Andhra Pradesh. It is the official languages of Odisha and Jharkhand. Odia is the sixth Classical Status Language as de..
1,01,25,770 Words | 2,470 Tittles | xml format | 5 domainsPunjabi is the principal and administrative language of Punjab. Punjabi is not only spoken in Punjab, but in India it is also a language of Lehnda Punjab in Pakistan. Punjabi is an Indo-Aryan language. This same Punjabi language is being written in two epigraphs, in Gurmukhi script and ..