IDEA Lab

Automatically Filtering Irrelevant Words for Applications in Language Acquisition.

Gihad Sohsah, Emrah Akkurt, Ilkin Safarli, Muhammed Unal, Onur Guzey

Published in:

IEEE International Conference on Machine Learning and Applications (ICMLA), 2014.

Abstract

Building one's vocabulary in a language is an important component of language acquisition. Children learn their native language by being immersed in the language that is used in their environment. However, in second language acquisition, learners are often exposed to vocabulary that is selected by others specifically to aid language acquisition such as textbooks and word-lists. In this paper, we are presenting a machine learning based method for automatically selecting words that are relevant to the language acquisition task. The word relevancy is determined using data collected from 30 practicing English as a Second Language teachers for this purpose. We demonstrate the viability of this approach by using words from two major corpora, although in practice any corpora such as Google Books corpus can be utilized.

Classification of word levels with usage frequency, expert opinions and machine learning

Gihad Sohsah, Muhammed Unal, Onur Guzey

Published in:

British Journal of Educational Technology (BJET), 2015.

Abstract

Educational applications for language teaching can utilize the language levels of words to target proficiency levels of students. This paper and the accompanying data provide a methodology for making educational standard-aligned language-level predictions for all English words. The methodology involves expert opinions on language levels and extending these opinions to other words using machine learning and data from a large corpus. Common European Framework for Languages (CEFR) level predictions for about 50‚ÄČ000 words, which can be readily used in educational applications, are also provided. For applications where the cost of misclassification varies, machine learning model parameters and algorithm selection must be adjusted. A large number of expert opinions taken from a survey with 30 practicing language teachers that can be used for this adjustment are also released. The overall methodology can be applied to low-resource languages, where CEFR-level classifications may not exist, by adding a comparable survey and corpus. The data are released with a Creative Commons Attribution license to enable free mixing, sharing and even use in commercial applications.