Learning linguistic features from natural text data by independent component analysis

J Väyrynen - 2005 - aaltodoc.aalto.fi
2005aaltodoc.aalto.fi
The analysis of natural language is an important field for language technology. The symbolic
nature of written language can be encoded in numeric form and analyzed using statistical
signal processing methods. In this thesis, it is assumed that word usage statistics, namely
word frequencies in different contexts, contain linguistic information that can be extracted
using statistical feature extraction methods. Independent component analysis, an
unsupervised statistical method for blind source separation, is applied to extracting features …
Abstract
The analysis of natural language is an important field for language technology. The symbolic nature of written language can be encoded in numeric form and analyzed using statistical signal processing methods. In this thesis, it is assumed that word usage statistics, namely word frequencies in different contexts, contain linguistic information that can be extracted using statistical feature extraction methods. Independent component analysis, an unsupervised statistical method for blind source separation, is applied to extracting features for words using a text corpus. A study between the closeness of match between the emergent features and traditional syntactic word categories for words shows that independent component analysis extracts features that resemble more linguistic categories than features extracted with principal component analysis.
aaltodoc.aalto.fi
以上显示的是最相近的搜索结果。 查看全部搜索结果