[PDF][PDF] Shallow Text Analysis and Machine Learning for Authorship Attribtion.
K Luyckx, W Daelemans - CLIN, 2004 - cnts.ua.ac.be
CLIN, 2004•cnts.ua.ac.be
Current advances in shallow parsing and machine learning allow us to use results from
these fields in a methodology for Authorship Attribution. We report on experiments with a
corpus that consists of newspaper articles about national current affairs by different
journalists from the Belgian newspaper De Standaard. Because the documents are in a
similar genre, register, and range of topics, token-based (eg, sentence length) and lexical
features (eg, vocabulary richness) can be kept roughly constant over the different authors …
these fields in a methodology for Authorship Attribution. We report on experiments with a
corpus that consists of newspaper articles about national current affairs by different
journalists from the Belgian newspaper De Standaard. Because the documents are in a
similar genre, register, and range of topics, token-based (eg, sentence length) and lexical
features (eg, vocabulary richness) can be kept roughly constant over the different authors …
Abstract
Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experiments with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspaper De Standaard. Because the documents are in a similar genre, register, and range of topics, token-based (eg, sentence length) and lexical features (eg, vocabulary richness) can be kept roughly constant over the different authors. This allows us to focus on the use of syntax-based features as possible predictors for an author’s style, as well as on those token-based features that are predictive to author style more than to topic or register. These style characteristics are not under the author’s conscious control and therefore good clues for Authorship Attribution. Machine Learning methods (TiMBL and the WEKA software package) are used to select informative combinations of syntactic, token-based and lexical features and to predict authorship of unseen documents. The combination of these features can be considered an implicit profile that characterizes the style of an author.
cnts.ua.ac.be
以上显示的是最相近的搜索结果。 查看全部搜索结果