A Gold Standard Kannada Raw Text Corpus

0 reviews requests (36)

Owner Central Institute of Indian Languages

Catalogue Number: 1127

Stock In Stock

OverView

77,63,124 words | 1772 Titles | Data and Metadata in XML for...

Please Login to see the price

Tags: Kannada Raw Text Corpus

Categories Cart Account Search Recent View Go to Top

Dataset Description

77,63,124 words | 1772 Titles | Data and Metadata in XML format | 6 text domains

Kannada is one of the Ancient Indian language which belongs to the Dravidian family. It has its own script. Even though Kannada is considered as a classical language because of its ancient history in literature, the Kannada text corpus is extracted from contemporary text sources. To keep the corpus balanced, the Kannada text corpus is collected by keying-in and proofing text extracts from books of various domains or Crawled from News websites. The available corpus is in Unicode standard and the data with metadata is in XML format.

The available Text Corpus details:

Domains	Words	Percentage of Total Corpus
Aesthetics	37,78,723	48.68 %
Commerce	2,07,053	2.67 %
Mass Media	2,07,053	34.54 %
Official Document	5,357	0.07 %
Science and Technology	2,43,166	3.13 %
Social Sciences	8,47,214	10.91 %

A detailed explanation of the Kannada Text Corpus will be available in the Kannada Raw Text Corpus Documentation.

For any research-based citations, please use the following citations:

Ramamoorthy, L., Narayan Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji, Malini N Abhyankar, Rajesha N. & Manasa G. 2019. A Gold Standard Kannada Raw Text Corpus. Central Institute of Indian Languages, Mysore.
Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.

Item specifics

Authors Ramamoorthy L., Narayan Choudhary, Vijayalaxmi F Patil, Chetan Suryakant Baji, Rajesha N., Manasa G, Sunitha Rajendra, Reshma S., Kavitha L, Malini N. Abhyankar
Corpus Type Raw Corpus
Catalogue Number 1127
ISBN 978-81-7343-226-2
Data Source Typed+Cleaned
Character Count 64909781
Word Count 7763124
Release Date 04-Apr-2019
Terms and Conditions General instructions for use of the resources provided by LDC-IL.

A Gold Standard Kannada Raw Text Corpus

OverView

A Gold Standard Kannada Raw Text Corpus

Dataset Description

Item specifics

Write a review