Automatic language identification in texts: A survey
Language identification (" LI") is the problem of determining the natural language that a
document or part thereof is written in. Automatic LI has been extensively researched for over …
document or part thereof is written in. Automatic LI has been extensively researched for over …
[PDF][PDF] Language identification: The long and the short of the matter
Abstract Language identification is the task of identifying the language a given document is
written in. This paper describes a detailed examination of what models perform best under …
written in. This paper describes a detailed examination of what models perform best under …
[PDF][PDF] Cross-domain feature selection for language identification
We show that transductive (cross-domain) learning is an important consideration in building
a general-purpose language identification system, and develop a feature selection method …
a general-purpose language identification system, and develop a feature selection method …
Automatic detection and language identification of multilingual documents
Abstract Language identification is the task of automatically detecting the language (s)
present in a document based on the content of the document. In this work, we address the …
present in a document based on the content of the document. In this work, we address the …
[PDF][PDF] Accurate language identification of twitter messages
We present an evaluation of “off-theshelf” language identification systems as applied to
microblog messages from Twitter. A key challenge is the lack of an adequate corpus of …
microblog messages from Twitter. A key challenge is the lack of an adequate corpus of …
Language identification in web pages
This paper discusses the problem of automatically identifying the language of a given Web
document. Previous experiments in language guessing focused on analyzing" coherent" text …
document. Previous experiments in language guessing focused on analyzing" coherent" text …
[PDF][PDF] Reconsidering Language Identification for Written Language Resources.
The task of identifying the language in which a given document (ranging from a sentence to
thousands of pages) is written has been relatively well studied over several decades …
thousands of pages) is written has been relatively well studied over several decades …
Automatic language classification by means of syntactic dependency networks
This article presents an approach to automatic language classification by means of linguistic
networks. Networks of 11 languages were constructed from dependency treebanks, and the …
networks. Networks of 11 languages were constructed from dependency treebanks, and the …
Factors that affect the accuracy of text-based language identification
The classification accuracy of text-based language identification depends on several factors,
including the size of the text fragment to be identified, the amount of training data available …
including the size of the text fragment to be identified, the amount of training data available …
Language identification based on string kernels
C Kruengkrai, P Srichaivattana… - … , 2005. ISCIT 2005., 2005 - ieeexplore.ieee.org
In this paper, we propose a novel approach for automatically identifying the language of a
given text based on the concept of string kernels. Our approach can identify the language …
given text based on the concept of string kernels. Our approach can identify the language …