Text type resource
42,37,440 Words | 1,460 Tittles | XML format | 3 domainsBengali is the official language of West Bengal and Tripura. It belongs to the Indo-Aryan language family. Bengali is influenced by Sanskrit. Greater use of Bengali has contributed to the growth of the language in terms of vocabulary and the number of styles and registers. Bengali is spoke..
29,15,544 Words | 80 Tittles | XML format | 5 domainsBodo is a major tribal language that belongs to the Tibeto-Burman language family. Bodo language is spoken in Assam and other parts of North-East India. The Bodo language is one of the major language of Assam and official language in the Bodoland Territorial Area Districts. Several rivers lik..
8,01,771 Words | 183 Tittles | XML format | 05 Text DomainsDogri is an Indo-Aryan language spoken by about five million people in India and Pakistan, particularly in the Jammu region of Jammu and Kashmir and Himachal Pradesh, also in northern Punjab, other parts of Jammu and Kashmir. Dogri was originally written using the Dogri..
28, 62,413 Words | 1,369 Tittles | XML format | 06 Text DomainsGujarati is a major Indo-Aryan language and the administrative language of Gujarat, Union territories of Daman and Diu and Dadra and Nagar Haveli. LDC-IL Gujarati Raw Text Corpus developed according to various factors such as quality of the text, representativeness..
1,03,17,177 Words | 1,223 Tittles | XML format | 4 domains Hindi is a Major, Indo-Aryan language, a descendant of Sanskrit, which is spoken in the central and northern India, in the states of Bihar, Chhattisgarh, Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttarakhand, and Uttar Pr..
77,63,124 words | 1772 Titles | Data and Metadata in XML format | 6 text domainsKannada is one of the Ancient Indian language which belongs to the Dravidian family. It has its own script. Even though Kannada is considered as a classical language because of its ancient history in literature, the Kannada text corpus is extracted from contem..
4,66,054 Words | 108 Tittles | XML format | 2 domainsKashmiri language is one of the 22 scheduled languages of India and is the part of the Eighth Schedule in the constitution of Jammu and Kashmir. It belongs to the Dardic group of Indo-Aryan Language family. Like other Indo-Aryan languages, Kashmiri also comprises of many dialects. The Kas..
39,95,611 Words | 282 Tittles | XML format | 4 domainsKonkani is the principal and administrative language of Goa. Konkani is an Indo-Aryan language belonging to the Indo-European family of languages and is spoken along the western coast of India. The Konkani language is spoken widely in the western coastal region of India is known as Konkan. T..
53,16,552 Words | 499 Tittles | XML format | 5 domainsMaithili is an Indio-Aryan language, a direct descendant of Sanskrit. Which is spoken in the states of Bihar, Jarkhand, and part of Nepal. It is one of the scheduled language of India. LDC-IL Maithili Text Corpus developed according to various factors such as quality of the text, repre..
63, 70,954 Words | 1,119 Titles | XML format | 6 domainsMalayalam is a highly agglutinative and morphologically rich language. The actual pattern of language use in natural texts reveals the evidence of language trait. Government of India set up Linguistic Data Consortium for Indian Languages to help those who endeavor in the language developm..
61,45,278 words | 4,31,27,842 characters | 6 DomainsManipuri Text Corpus is encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from contemporary texts in a typed meth..
21,57,078 Words | 678 Tittles | XML format | 5 domainsMarathi is an Indo-Aryan language. It is the official language of Maharashtra state of India. Marathi is primarily spoken in Maharashtra (India) and parts of neighboring states of Gujarat, Madhya Pradesh, Goa, Karnataka (Particularly the bordering districts of Belgaum, Bidar, Gulbarga, and U..
70,57,524 Words | 1,347 Tittles | XML format | 6 domains Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedule languages of India. It is spoken in most of the North-Eastern states of India and also other states, similarly Delhi, Uttaranchal, Uttar Pradesh, Bihar, Jharkhand , etc. Nepali is als..
15, 88, 287 Words | 206 Tittles | XML format | 05 Text DomainsOdia (formerly Oriya) is a major Indo-Aryan language, which is spoken in the states of Odisha, West Bengal, Jharkhand, Chhattisgarh, and Andhra Pradesh. It is the official language of Odisha and Jharkhand. Odia is the sixth Classical Status Language as designated by the Govt...
1,01,25,770 Words | 2,470 Tittles | XML format | 5 domainsPunjabi is the principal and administrative language of Punjab. Punjabi is not only spoken in Punjab but in India, it is also a language of Lehnda Punjab in Pakistan. Punjabi is an Indo-Aryan language. This same the Punjabi language is being written in two epigraphs, in Gurmukh..