Towards Environmental Preference Based Speech Enhancement For Individualised Multi-Modal Hearing Aids
J Kirton-Wingate, S Ahmed, A Hussain… - arXiv preprint arXiv …, 2024 - arxiv.org
arXiv preprint arXiv:2402.16757, 2024•arxiv.org
Since the advent of Deep Learning (DL), Speech Enhancement (SE) models have
performed well under a variety of noise conditions. However, such systems may still
introduce sonic artefacts, sound unnatural, and restrict the ability for a user to hear ambient
sound which may be of importance. Hearing Aid (HA) users may wish to customise their SE
systems to suit their personal preferences and day-to-day lifestyle. In this paper, we
introduce a preference learning based SE (PLSE) model for future multi-modal HAs that can …
performed well under a variety of noise conditions. However, such systems may still
introduce sonic artefacts, sound unnatural, and restrict the ability for a user to hear ambient
sound which may be of importance. Hearing Aid (HA) users may wish to customise their SE
systems to suit their personal preferences and day-to-day lifestyle. In this paper, we
introduce a preference learning based SE (PLSE) model for future multi-modal HAs that can …
Since the advent of Deep Learning (DL), Speech Enhancement (SE) models have performed well under a variety of noise conditions. However, such systems may still introduce sonic artefacts, sound unnatural, and restrict the ability for a user to hear ambient sound which may be of importance. Hearing Aid (HA) users may wish to customise their SE systems to suit their personal preferences and day-to-day lifestyle. In this paper, we introduce a preference learning based SE (PLSE) model for future multi-modal HAs that can contextually exploit audio information to improve listening comfort, based upon the preferences of the user. The proposed system estimates the Signal-to-noise ratio (SNR) as a basic objective speech quality measure which quantifies the relative amount of background noise present in speech, and directly correlates to the intelligibility of the signal. Additionally, to provide contextual information we predict the acoustic scene in which the user is situated. These tasks are achieved via a multi-task DL model, which surpasses the performance of inferring the acoustic scene or SNR separately, by jointly leveraging a shared encoded feature space. These environmental inferences are exploited in a preference elicitation framework, which linearly learns a set of predictive functions to determine the target SNR of an AV (Audio-Visual) SE system. By greatly reducing noise in challenging listening conditions, and by novelly scaling the output of the SE model, we are able to provide HA users with contextually individualised SE. Preliminary results suggest an improvement over the non-individualised baseline model in some participants.
arxiv.org
以上显示的是最相近的搜索结果。 查看全部搜索结果