A Gold Standard Kannada Raw Text Corpus

A Gold Standard Kannada Raw Text Corpus

0 reviews requests (18)
Catalogue Number: 1127
Stock In Stock

OverView

Kannada text Corpus of 77,63,124 words | 1772 Titles | Data...
Please Login to see the price

Dataset Description

Kannada text Corpus of 77,63,124 words | 1772 Titles | Data and Metadata in XML format |  6 text domains


Kannada is one of the Ancient Indian language which belongs to Dravidian family. It has its own script. Even though Kannada is considered as a classical language because of its ancient history in literature, the Kannada text corpus is extracted from contemporary text sources. To keep the corpus balanced, the Kannada text corpus is collected by keying-in and proofing text extracts from books of various domains or Crawled from News websites. The available corpus is in Unicode standard and the data with metadata is in XML format.

 

The available Text Corpus details are as follows.

 

  •          Aesthetics Domain - 37,78,723 words - 48.68% corpus
  •          Commerce Domain - 2,07,053 words - 2.67% corpus
  •          Mass Media Domain - 26,81,611 words - 34.54% corpus
  •          Official Document Domain - 5,357 words - 0.07% corpus
  •          Science and Technology Domain - 2,43,166 words - 3.13% corpus
  •          Social Sciences Domain - 8,47,214 words - 10.91% corpus 

More detailed explanation of the Kannada Text Corpus will be available in the Kannada Raw Text Corpus Documentation. 

For any research based citations, please use the following citations:

  • Ramamoorthy, L., Narayan  Choudhary, Vijayalaxmi F. Patil, Chetan Suryakant Baji,  Malini N Abhyankar, Rajesha N. & Manasa G. 2019. A Gold Standard Kannada Raw Text Corpus. Central Institute of Indian Languages, Mysore.
  • Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview"  in  Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.

Item specifics

  • Authors Ramamoorthy L., Narayan Choudhary, Vijayalaxmi F Patil, Chetan Suryakant Baji, Rajesha N., Manasa G, Sunitha Rajendra, Reshma S., Kavitha L, Malini N. Abhyankar
  • Corpus Type Raw Corpus
  • Catalogue Number 1127
  • ISBN 978-81-7343-226-2
  • Data Source Typed+Cleaned
  • Character Count 64909781
  • Word Count 7763124
  • Release Date 04/04/2019
  • Terms and Conditions General instructions for use of the resources provided by LDC-IL.
Commercial User
Non-Commercial User

Write a review

Please login or register to review