Get Mystery Box with random crypto!

Choosing Transfer Languages for Cross-Lingual Learning Given | Data Science by ODS.ai 🦜

Choosing Transfer Languages for Cross-Lingual Learning

Given a particular task low-resource language and NLP task, how can we determine which languages we should be performing transfer from?
If we train models on the top K transfer languages suggested by the ranking model and pick the best one, how good is the best model expected to be?

In the era of transfer learning now we have a possibility not to collect the massive data for each language, but using already pretrained model achieve good scores training on smaller data. But how should we choose the language from which we can transfer knowledge? Will it be okay to transfer from English to Chinese or from Russian to Turkish?

The paper investigate on this question. The features the authors created to detect the best transfer language are the follows:

* Dataset Size: as simple as it is — do we have enough data in transfer language with respect to ratio to train language?
* Type-Token Ratio: diversity of both languages;
* Word Overlap and Subword Overlap: kind of similarity of languages; it is very good if both languages have as much the same words as possible;
* Geographic distance: are the languages from the territories that are close on the Earth surface?
* Genetic distance: are they close to each other in terms of language genealogical tree?
* Inventory distance: are they sound familiar?

The idea is pretty simple and clear but very important for studies of multilingual models.

The post is based on reading task from Multilingual NLP course by CMU (from the post).