Tamil Parts of Speech Annotated Corpus
OverView
2131256 Tags | 1750935 Words | 172089 SentencesThe Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for ...Your request cart is empty!
Dataset Description
2131256 Tags | 1750935 Words | 172089 Sentences
The Linguistic Data Consortium for Indian Languages (LDC-IL) is developed Parts-of-Speech annotated corpus for Scheduled Indian languages. The corpus is annotated with Part-of-Speech (PoS) tags based on the Bureau of Indian Standards (BIS) PoS Tagset. This data is a significant resource for natural language processing and linguistic research. LDC-IL developed annotated text corpora for Tamil . The Tamil PoS annotated corpus is automatically tagged and then verified by linguistic experts to ensure accuracy and consistency.
Tamil PoS annotated Corpus contains 2131256 Part-of-Speech tags.
For any research-based citations, please use the following citations:
1. Dr. Amudha R, Dr. Kamaraj S, Dr. Prem Kumar L. R., Dr. Narayan Choudhary 2026. Tamil Parts of Speech Annotated Corpus. Central Institute of Indian Languages, Mysore. 978-81-69175-98-2
2. Rejitha K. S. and Narayan Kumar Choudhary. (ed.). 2026. LDC-IL Parts of Speech Annotated Corpus Based on BIS Framework. Central Institute of Indian Languages, Mysore. 978-81-69175-60-9.
Item specifics
- Authors Dr. Amudha R, Dr. Kamaraj S, Dr. Prem Kumar L. R., Dr. Narayan Choudhary
- Corpus Type Parts of Speech Annotated Text Corpus
- Data Source Annotated
- Word Count 1750935
- Tag Count 2131256
