Frequently Asked Questions

Frequently Asked Questions about Linguistic Data Consortium for Indian Languages (LDC-IL)

1. What is LDC-IL?

Ans. The mandate of the LDC-IL is to cover as many languages as possible in its endeavor to help Indian languages to absorb technology and to develop tools and data in language technology. For more details about LDC-IL, its history and other activities, please go to the LDCIL home page at www.ldcil.org

2. Is LDCIL related to ELRA/LDC of The University of Pennsylvania?

Ans. There is no connection. However, this consortium has been set up on the lines of those Institutions and share the same goals for Indic languages and other related languages.

3. How LDC-IL ensures the quality of the data?

Ans. LDC-IL has several mechanisms to ensure that the quality of the data is as per the requirements of the needs of the developer community working in the area of language technology. There are several checks and balances. For a generic description on the text and speech datasets of LDC-IL, please see the respective generic documentations: LDC-IL Raw Speech Corpora: An Overview and LDC-IL Raw Text Corpora: An Overview

4. List of all datasets available at LDCIL data portal?

Ans. Currently LDC-IL data portal is releasing 31 datasets of different Indian languages, i.e.

A Gold Standard Raw Text Corpus:

Bengali	Bodo	Dogri	Gujarati	Hindi	Kannada	Kashmiri	Konkani	Maithili
Malayalam	Manipuri	Marathi	Nepali	Odia	Punjabi	Tamil	Telugu	Urdu


Raw Speech Corpus:

Bengali	Bodo	Hindi	Kannada	Konkani	Maithili	Malayalam Manipuri
Marathi	Nepali	Punjabi	Telugu	Urdu

5. How recent is the data?

Ans. The LDC-IL text and speech corpora are of contemporary origin. Generally, the LDC-IL text data does not go beyond the year 1990. The text is prose only and no classical or older literature is part of the datasets. The speech datasets have been collected on-field, starting 2007 onwards.

6. Does it contains social media data or what is the source of data?

Ans. The current text datasets does not include text from social media. But there are plans to aggregate text from social media as well, probably as a separate dataset.

7. What sort of applications benefit from this data?

Ans. The text datasets may be used for several types of language modeling tasks using machine learning. Additionally, as the datasets are representative ones, they can also be used for several types of linguistic analysis and may be useful in several sub-disciplines of language and linguistic studies and language technology. The speech datasets can be used for Automatic Speech Recognition and Text to Speech Systems as well as other types of phonetic, phonological and acoustic analysis.

8. Are Data available for free?

Ans. Some datasets are available for free for research students (Research Scholars) and recognized academic/research Institutions of India with the cap that the derivatives prepared using these datasets should not be used for commercial purposes. For more the terms and conditions, please see the Non-Commercial Undertaking here.

9. What are the terms and conditions for free usage of the datasets?

Ans. For individual researchers affiliated to recognized academic and research Institutes, the use is limited only during their tenure as affiliated to that Institute (or subsequently other research Institutes). Once they cease to be affiliated to any research and non-commercial Institute, they should destroy the datasets and not use it for any commercial purposes. For more the terms and conditions of non-commercial use, please see the Non-Commercial Undertaking here.

10. Is it available for commercial use?

Ans. Yes, all the datasets are available for commercial use. The terms and conditions of commercial use is given in the Commercial Undertaking.

11. Are there any discounts for commercial use?

Ans. The discount or subsidies in the pricing has been proposed to the different tiers of commercial users who are going to contribute to promote Indian languages through language technology. A detailed description is available in the LDC-IL website.

Subsidies are given to three categories of users as noted below:

MNCs and Foreign Entities	Base Price
Non-MNC Indian Company	80% of Base Price
MSME/Entities from SAARC Countries	60% of Base Price
Startups/MSMEs with a turnover of less than 5 crores	20% of Base Price

As noted above, there are no subsidies for MNCs and entities not belonging to India. For other three categories, subsidies are given. To avail the subsidies, the users have to provide the necessary documents to prove their eligibility in the subsidized categories.

As of now, if you are seeking a subsidized price under any of the three categories of Startup, MSME or Non-MNC Indian company, you may have to provide your candidature, the details of your company along with the necessary documents (e.g. incorporation document, DIPP registration documents etc.) to this Institute.

12. What is the process of user registration approval?

Ans. Commercial users are self-approved, upon verification of their emails. Non-Commercial users are approved by LDC-IL staff upon verification of the documents submitted by them. If you are submitting wrong documents, your email may get blocked.

13. What are the documents required to successfully register on the portal?

Ans. Depending upon the user type, documents may vary. Students/researchers/faculties need to submit a copy of their valid identity card as well as a letter (forwarded duly by the Head of the Department/Competent Authority) stating the need of the dataset and purpose for the same. Name of dataset user also needs to be placed on the request letter forwarded by the competent authority. If the request comes from the Head of the Institution, it must be on the letter head stating the target group of users responsible for the management of the datasets.

For commercial entities seeking subsidized pricing, necessary documents for the incorporation along with the necessary certifications proving that they belong to the specified categories (e.g. certified company returns, balance sheet other certifications) need to be submitted, in hard copies while sending the signed undertaking to LDC-IL at the specified address.

14. Can we upload the documents online?

Ans. Non-Commercial users can upload their IDs while requesting registration. Other documents need to be submitted in hard copies after raising the document request. Necessary documents along with the Undertaking (duly attested) and cheque/DD should be sent to "The Director, Central Institute of Indian Languages, Manasgangotri, Mysuru - 570006 Karnataka". Also, write as “Request for LDC-IL dataset” as a superscript on the cover. You can send this application either before or after making the data request on the data portal.

15. Is there any verification process for users registering on the data portal?

Ans. Registration on the data distribution portal is not automatic as there are user categories for which verification is required. All users, except the users willing to purchase the datasets at the base price, need to submit/upload the necessary documentations on the portal. Users seeking datasets also need to send the hard copies of the documents to LDC-IL by post before the users are verified and activated on the portal.

16. How many days will be required to get the data?

Ans. After payment is processed it may take minimum 7 to 10 working days to make the data available to you for download. If the data size is larger, (e.g. above 2GB, as in the case of all the speech data), it may take further more time as the data is delivered in a physical media (chargeable separately) and couriered to you the address of the requester.

17. How the user can receive the data?

Ans. The commercial and non-commercial users can obtain the data in following ways:

Sl. No.	Data Size	Link/Physical Media	Charges *
1	Upto 2 GB	Weblink	No charges
2	2 GB to 110 GB	128 GB Pen Drive	Rs. 2000/- (within India)
3	Exceeding 110 GB	External Hard Disk	Rs. 4500/- (within India)

· Charges include physical media expenses as well as shipping charges.

· The shipping charges may be extra for foreign countries which will be intimated after receiving the data request.

18. Can the payments be made online?

Ans. As of now, online payment facilities have not been made. Until that happens, all payments must be made either via Cheque or Demand Draft drawn in favour of “MHRD HIGHER CAS CLG, NEW DELHI”. In exceptional cases, we may provide a mechanism to do NEFT/RTGS/Bank Transfer (in which case you need to write separately to us.)

19. Refund Policy

Ans. After you have made the payment and it has been realized (i.e. deposited into the Government Account), no refund will be entertained.

20. As a researcher how long I can use this data?

Ans. If you are an academic license holder, you can use the datasets as long as you are affiliated to a not-for-profit research organizations. The moment you graduate out of it or start working for a commercial organization (which is not a not-for-profit organization or a recognized research organization in India), you should stop exploiting this dataset.

21. I have some interesting sizeable linguistic resource dataset that I want to publish. Can LDCIL help in it?

Ans. LDC-IL is at present is a Government-run body but it aims to become a consortium (autonomous body) and stand on its own. If you have a resource that you think would be of use to the language technology development community, you can share it with us. LDC-IL will help you either publish it separately or make it a part of the existing datasets, giving you the credit for your contribution. For more information, please contact LDC-IL.

22. Would this data be updated in the future? If yes, how often?

Ans. The data may be updated as often as new content arrive relevant in the respective datasets. Updated datasets may be released in subsequent editions of the respective titles. Existing commercial customers may get an update on the same and upon payment of any differential amount, they may get the updated dataset.