Using transformers for multimodal emotion recognition: Taxonomies and state of the art review

S Hazmoune, F Bougamouza - Engineering Applications of Artificial …, 2024 - Elsevier
Emotion recognition is an aspect of human-computer interaction, affective computing, and
social robotics. Conventional unimodal approaches for emotion recognition, depending on …

Reproducing whisper-style training using an open-source toolkit and publicly available data

Y Peng, J Tian, B Yan, D Berrebbi… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Pre-training speech models on large volumes of data has achieved remarkable success.
OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised …

OWSM v3. 1: Better and faster open whisper-style speech models based on e-branchformer

Y Peng, J Tian, W Chen, S Arora, B Yan, Y Sudo… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent studies have advocated for fully open foundation models to promote transparency
and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) …

COLLD: Contrastive Layer-to-Layer Distillation for Compressing Multilingual Pre-Trained Speech Encoders

HJ Chang, N Dong, R Mavlyutov… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Large-scale self-supervised pre-trained speech encoders outperform conventional
approaches in speech recognition and translation tasks. Due to the high cost of developing …

Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers

G Strimel, Y Xie, BJ King, M Radfar… - International …, 2023 - proceedings.mlr.press
Streaming speech recognition architectures are employed for low-latency, real-time
applications. Such architectures are often characterized by their causality. Causal …

[HTML][HTML] Model and Method for Providing Resilience to Resource-Constrained AI-System

V Moskalenko, V Kharchenko, S Semenov - Sensors, 2024 - mdpi.com
Artificial intelligence technologies are becoming increasingly prevalent in resource-
constrained, safety-critical embedded systems. Numerous methods exist to enhance the …

Speech Recognition Transformers: Topological-lingualism Perspective

S Singh, M Singh, V Kadyan - arXiv preprint arXiv:2408.14991, 2024 - arxiv.org
Transformers have evolved with great success in various artificial intelligence tasks. Thanks
to our recent prevalence of self-attention mechanisms, which capture long-term …

Improving vision-inspired keyword spotting using dynamic module skipping in streaming conformer encoder

A Bittar, P Dixon, M Samragh, K Nishu… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Using a vision-inspired keyword spotting framework, we propose an architecture with input-
dependent dynamic depth capable of processing streaming audio. Specifically, we extend a …

CTC Blank Triggered Dynamic Layer-Skipping for Efficient Ctc-Based Speech Recognition

J Hou, P Wang, J Zhang, M Yang… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Deploying end-to-end speech recognition models with limited computing resources remains
challenging, despite their impressive performance. Given the gradual increase in model size …