Frequently Asked Questions about Linguistic Data Consortium for Indian Languages (LDC-IL)

 

1.     What is LDC-IL?

Ans. The Linguistic Data Consortium for Indian Languages is a scheme of the Department of Higher Education, Ministry of Human Resource Development, Govt. of India implemented by the Central Institute of Indian Languages, a subordinate office of the Deptt. Of Higher Education, MHRD located at Manasgangotri, Mysore, Karnataka, INDIA. The mandate of the LDC-IL is to cover as many languages as possible in its endeavor to help Indian languages to absorb technology and develop to become vehicles of language technology.

 

2.     Are data available for free?

Ans. Some datasets are available for free for recognized academic and research Institutions of India with the cap that the derivatives prepared using these datasets should not be used for commercial purposes.

 

3.     Is it available for commercial use?

Ans.Yes, all the datasets are available for commercial use.

 

4.     What are the terms and conditions for free usage of the datasets?

Ans. For individual researchers affiliated to recognized academic and research Institutes, the use is limited only during their tenure as affiliated to that Institute (or subsequently other research Institutes). Once they cease to be affiliated to any research and non-commercial Institute, they should destroy the datasets and not use it for any commercial purposes.

 

5.     Are there any discounts for commercial use?

Ans. The discount or subsidies in the pricing has been proposed to the different tiers of commercial users who are going to contribute to promote Indian languages through language technology. Detailed description is available in LDC-IL website.

Subsidies are given to three categories of users as noted below:

MNCs and Foreign Entities

Base Price

Non-MNC Indian Company

80%  of Base Price

MSME/Entities from SAARC Countries

60%  of Base Price

Startups/MSMEs with a turnover of less than 5 crores

20%  of Base Price

 

As noted above, there are no subsidies for MNCs and entities not belonging to India. For other three categories, subsidies are given. To avail the subsidies, the users have to provide the necessary documents to prove their eligibility in the subsidized categories.

As of now, if you are seeking a subsidized price under any of the three categories of Startup, MSME  or non-MNC Indian company, you may have to provide your candidature to be considered as such to this Institute.

For that purpose, you can give the details of your company along with the necessary documents (e.g. incorporation document, DIPP registration documents etc.) and send an application to be considered as such to "Director, Central Institute of Indian Languages, Manasgangotri, Mysuru - 570006 Karnataka".  You can send this application either before or after making the data request on the data portal. However, you may skip making your payment / drawing DD/Cheque until an approval comes from this office.


 

6.     Is there any verification process for users registering on the data portal?

Ans. Registration on the data distribution portal is not automatic as there are user categories for which verification is required. All users, except the users willing to purchase the datasets at the base price, need to submit/upload the necessary documentations on the portal. Users seeking datasets also need to send the hard copies of the documents to LDC-IL by post before the users are verified and activated on the portal.

 

7.     What is the process of user registration approval?

Ans. After completing the registration process in the data portal user account will be approved by CIIL.

 

8.     Can the payments be made online?

Ans. As of now, online payment facilities have not been made. Until that happens, all payments must be made either via Cheque or Demand Draft written in favour of “MHRD HIGHER CAS CLG, NEW DELHI”.

 

9.     What are the documents required to successfully register on the portal?

Ans. Depending upon the user type, documents may vary. Students/researchers/faculties need to submit a copy of their valid identity card as well as a letter (forwarded duly by the Head of the Department/Competent Authority) stating the need the dataset and purpose for the same. Names of dataset users also need to be placed on the request letter forwarded by the competent authority. If the request comes from the Head of an Institution, it must be on the letter head stating the target group of users responsible for management of the datasets.

For commercial entities seeking subsidized pricing, necessary documents for the incorporation along with the necessary certifications proving that they belong to the specified categories (e.g. certified company returns, balance sheet other certifications) need to be submitted.

 

 10.     As a researcher how long I can use this data?

Ans. If you are an academic license holder, you can use the datasets as long as you are affiliated to a not-for-profit research organizations. The moment you graduate out of it or start working for a commercial organization (which is not a not-for-profit organization or a recognized research organization), you should stop exploiting this dataset.

 

11.     I have some interesting sizeable linguistic resource dataset that I want to publish. Can LDC-IL help in it?

Ans. LDC-IL is at present a government run body but it aims to become a consortium and stand on its. If you have a resource that you think would be of use to the language technology development community, you can share it with us. LDC-IL will help you either publish it separately or make it a part of the existing datasets, giving you the credit of your contribution. For more information, please contact LDC-IL.


12.  Is LDC-IL related to European Language Resource Association (ELRA) of Europe or Linguistic Data Consortium (LDC) of University of Pennsylvania, USA?

Ans. There is no connection. However, this consortium has been set up on the lines of those Institutions and share the same goals for Indic languages and other related languages.

 

13.     List of all datasets available at LDC-IL data portal?

Ans. Currently LDC-IL data portal is releasing 31 datasets for different Indian languages, i.e..

A Gold Standard Raw Text Corpus:
BengaliBodoDogriGujaratiHindiKannadaKashmiriKonkaniMaithili
Malayalam ManipuriMarathiNepaliOdiaPunjabiTamilTeluguUrdu
Raw Speech Corpus:
BengaliBodoHindiKannada  KonkaniMaithiliMalayalam Manipuri
Marathi
Nepali
PunjabiTeluguUrdu


14.     How LDC-IL ensures the quality of the data?

Ans. LDC-IL has several mechanisms to ensure that the quality of the data is as per the requirements of the needs of the developer community working in the area of language technology. There are several checks and balances. For a generic description on the text and speech datasets of LDC-IL, please see the respective generic documentations. GENERIC LDC-IL Raw Speech Corpus Documentation, and GENERIC LDC-IL Raw Text Corpus Documentation.


15.    How many days will be required to get the data?

Ans. After payment is processed it may take minimum 7 to 10 working days to make the data available to you for download. It may happen even sooner.

 

16.     How recent is the data?

Ans. The LDC-IL text and speech corpora are of contemporary origin. Generally, the LDC-IL text data does not beyond the year 1990. The text is prose only and no classical or older literature are part of the datasets. The speech datasets have been collected on-field, starting 2007 onwards.

 

17.     Does it contain social media data or what is the source of data? 

Ans. The current text datasets does not include text from social media. But there are plans to aggregate text from social media as well.

 

18.     What sort of applications benefit from this data?

Ans. The text datasets may be used for several types of language modeling using machine learning. Additionally, as the datasets are representative ones, they can also be used for several types of linguistic analysis and may be useful in several sub-disciplines of language and linguistic studies and language technology. The speech datasets can be used for Automatic Speech Recognition and Text to Speech Systems as well as other types of phonetic, phonological and acoustic analysis.

 

19.     Would this data be updated in the future? If yes, how often?             

Ans.  The data may be updated as often as new content arrive relevant in the respective datasets. Updated datasets may be released in subsequent editions of the respective titles.


20.     How the user can receive the data?

 The commercial and non-commercial users can obtain the data in following ways:


Sl. No.

Data Size

Link/Physical Media

Charges *

1

Upto 2 GB

Web link  

No charges

2

2 GB to 110 GB

128 GB Pen Drive

Rs. 2000/- (within India)

3

Exceeding 110 GB

External Hard Disk

Rs. 4500/- (within India)


*  Charges included physical media expenses as well as shipping charges.

*  The shipping charges may be extra for foreign countries which will be intimated after receiving the data request.  

*  Separate DD/Cheque should be sent for Data charges & Physical media and shipping charges.