Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens

M Kim, J Choi, S Maiti, JH Yeo… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
In this paper, we propose methods to build a powerful and efficient Image-to-Speech
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …

Multilingual visual speech recognition with a single model by learning with discrete visual speech units

M Kim, JH Yeo, J Choi, SJ Park, YM Ro - arXiv preprint arXiv:2401.09802, 2024 - arxiv.org
This paper explores sentence-level Multilingual Visual Speech Recognition with a single
model for the first time. As the massive multilingual modeling of visual data requires huge …

Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages

M Kim, J Jung, H Rha, S Maiti, S Arora, X Chang… - arXiv preprint arXiv …, 2024 - arxiv.org
The capability to jointly process multi-modal information is becoming an essential task.
However, the limited number of paired multi-modal data and the large computational …

[HTML][HTML] Integrating IoT and visual question answering in smart cities: Enhancing educational outcomes

T Gao, G Wang - Alexandria Engineering Journal, 2024 - Elsevier
Emerging as a paradigmatic shift in urban development, smart cities harness the potential of
advanced information and communication technologies to seamlessly integrate urban …

Forging Tokens for Improved Storage-efficient Training

M Lee, S Park, B Heo, D Han, H Shim - arXiv preprint arXiv:2312.10105, 2023 - arxiv.org
Recent advancements in Deep Neural Network (DNN) models have significantly improved
performance across computer vision tasks. However, achieving highly generalizable and …

Machine Perceptual Quality: Evaluating the Impact of Severe Lossy Compression on Audio and Image Models

D Jacobellis, D Cummings, NJ Yadwadkar - arXiv preprint arXiv …, 2024 - arxiv.org
In the field of neural data compression, the prevailing focus has been on optimizing
algorithms for either classical distortion metrics, such as PSNR or SSIM, or human …

[PDF][PDF] Reducing Annotation and Computation Costs for Efficient Compressed Video Action Recognition

寺尾颯人 - 2024 - eprints.lib.hokudai.ac.jp
As described in Chapter 2, deep networks have shown remarkable progress in video
classification [Haraetal., 2017, Tranetal., 2018, Feichtenhofer, 2020, Feichtenhofer et al …

[PDF][PDF] SeiT++: Masked Token Modeling Improves Storage-efficient Training

M Lee, S Park, B Heo, D Han, H Shim - ecva.net
Recent advancements in Deep Neural Network (DNN) models have significantly improved
performance across computer vision tasks. However, achieving highly generalizable and …

[PDF][PDF] Research Statement: Scalable and Reliable Machine Learning with Language-guided Representation Learning

S Chun - sanghyukchun.github.io
Ensuring the real-world applicability of machine learning (ML) models poses a primary
challenge, namely, the ability to generalize effectively to unseen scenarios encountered …

[PDF][PDF] SeiT++: Masked Token Modeling Improves Storage-efficient Training (Supplementary Material)

MTM MAGE - ecva.net
To demonstrate the effectiveness of our token augmentation strategies, we explore another
token-based learning approach, MAGE [7]. MAGE introduced a unified training framework …