Automatic language identification in texts: A survey

T Jauhiainen, M Lui, M Zampieri, T Baldwin… - Journal of Artificial …, 2019 - jair.org
Language identification (" LI") is the problem of determining the natural language that a
document or part thereof is written in. Automatic LI has been extensively researched for over …

[PDF][PDF] Language identification: The long and the short of the matter

T Baldwin, M Lui - … technologies: The 2010 annual conference of …, 2010 - aclanthology.org
Abstract Language identification is the task of identifying the language a given document is
written in. This paper describes a detailed examination of what models perform best under …

[PDF][PDF] Cross-domain feature selection for language identification

M Lui, T Baldwin - … of 5th international joint conference on natural …, 2011 - aclanthology.org
We show that transductive (cross-domain) learning is an important consideration in building
a general-purpose language identification system, and develop a feature selection method …

Automatic detection and language identification of multilingual documents

M Lui, JH Lau, T Baldwin - Transactions of the Association for …, 2014 - direct.mit.edu
Abstract Language identification is the task of automatically detecting the language (s)
present in a document based on the content of the document. In this work, we address the …

[PDF][PDF] Accurate language identification of twitter messages

M Lui, T Baldwin - Proceedings of the 5th workshop on language …, 2014 - aclanthology.org
We present an evaluation of “off-theshelf” language identification systems as applied to
microblog messages from Twitter. A key challenge is the lack of an adequate corpus of …

Language identification in web pages

B Martins, MJ Silva - Proceedings of the 2005 ACM symposium on …, 2005 - dl.acm.org
This paper discusses the problem of automatically identifying the language of a given Web
document. Previous experiments in language guessing focused on analyzing" coherent" text …

[PDF][PDF] Reconsidering Language Identification for Written Language Resources.

B Hughes, T Baldwin, S Bird, J Nicholson… - …, 2006 - minerva-access.unimelb.edu.au
The task of identifying the language in which a given document (ranging from a sentence to
thousands of pages) is written has been relatively well studied over several decades …

Automatic language classification by means of syntactic dependency networks

O Abramov, A Mehler - Journal of Quantitative Linguistics, 2011 - Taylor & Francis
This article presents an approach to automatic language classification by means of linguistic
networks. Networks of 11 languages were constructed from dependency treebanks, and the …

Factors that affect the accuracy of text-based language identification

GR Botha, E Barnard - Computer Speech & Language, 2012 - Elsevier
The classification accuracy of text-based language identification depends on several factors,
including the size of the text fragment to be identified, the amount of training data available …

Language identification based on string kernels

C Kruengkrai, P Srichaivattana… - … , 2005. ISCIT 2005., 2005 - ieeexplore.ieee.org
In this paper, we propose a novel approach for automatically identifying the language of a
given text based on the concept of string kernels. Our approach can identify the language …