Expanding n-gram training data for language models based on morpho-syntactic transformations

L Verwimp, J Pelemans, H van Hamme… - … Linguistics in the …, 2015 - clinjournal.org
Computational Linguistics in the Netherlands Journal, 2015clinjournal.org
The subject of this paper is the expansion of n-gram training data with the aid of
morphosyntactic transformations, in order to create a larger amount of reliable n-grams for
Dutch language models. The main aim of this technique is to alleviate a classical problem
for language models: data sparsity. Moreover, since language models for automatic speech
recognition are usually trained on written language resources while they are tested on
spoken language, certain patterns that are typical for spontanous spoken language will be …
Abstract
The subject of this paper is the expansion of n-gram training data with the aid of morphosyntactic transformations, in order to create a larger amount of reliable n-grams for Dutch language models. The main aim of this technique is to alleviate a classical problem for language models: data sparsity. Moreover, since language models for automatic speech recognition are usually trained on written language resources while they are tested on spoken language, certain patterns that are typical for spontanous spoken language will be under-represented and patterns characteristic of written language will be over-represented. By adding transformed n-grams, we hope to adapt the language model such that it matches better with spoken language. We investigate whether a language model trained on the expanded data performs better than a baseline n-gram model with modified Kneser-Ney smoothing in terms of perplexity and word error rate. Several alternatives for the probability estimation of the transformed n-grams are explored, and an approach to deal with separable verbs in Dutch is also discussed.
clinjournal.org
以上显示的是最相近的搜索结果。 查看全部搜索结果