61,45,278 words | 4,31,27,842 characters | 6 Domains
Manipuri Text Corpus is encoded in a machine-readable form and stored in a standard format. The major encoding being used is Unicode and stored in XML format. The data is embedded with metadata information. The corpus has been created from contemporary texts in a typed method. LDC-IL Manipuri Text Corpus size is 6145278 words drawn from 1202 different titles. The six major domains are Aesthetics, Commerce, Mass Media, Official Documents, Science & Technology and Social Sciences respectively.
The available Text Corpus Details:
Percentage of Total
Science and Technology
- Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu, Longjam Anand Singh & M. Bidyarani Devi. 2019. A Gold Standard Manipuri Raw Text Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.
- Authors Ramamoorthy L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu, Longjam Anand Singh,Bidyarani Devi M
- Corpus Type Raw Corpus
- Catalogue Number 1146
- ISBN 978-81-7343-245-3
- Data Source Typed+Cleaned
- Character Count 43127842
- Word Count 6145278
- Release Date 04/04/2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.