Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens
In this paper, we propose methods to build a powerful and efficient Image-to-Speech
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …
Multilingual visual speech recognition with a single model by learning with discrete visual speech units
This paper explores sentence-level Multilingual Visual Speech Recognition with a single
model for the first time. As the massive multilingual modeling of visual data requires huge …
model for the first time. As the massive multilingual modeling of visual data requires huge …
Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages
The capability to jointly process multi-modal information is becoming an essential task.
However, the limited number of paired multi-modal data and the large computational …
However, the limited number of paired multi-modal data and the large computational …
[HTML][HTML] Integrating IoT and visual question answering in smart cities: Enhancing educational outcomes
T Gao, G Wang - Alexandria Engineering Journal, 2024 - Elsevier
Emerging as a paradigmatic shift in urban development, smart cities harness the potential of
advanced information and communication technologies to seamlessly integrate urban …
advanced information and communication technologies to seamlessly integrate urban …
Forging Tokens for Improved Storage-efficient Training
Recent advancements in Deep Neural Network (DNN) models have significantly improved
performance across computer vision tasks. However, achieving highly generalizable and …
performance across computer vision tasks. However, achieving highly generalizable and …
Machine Perceptual Quality: Evaluating the Impact of Severe Lossy Compression on Audio and Image Models
In the field of neural data compression, the prevailing focus has been on optimizing
algorithms for either classical distortion metrics, such as PSNR or SSIM, or human …
algorithms for either classical distortion metrics, such as PSNR or SSIM, or human …
[PDF][PDF] Reducing Annotation and Computation Costs for Efficient Compressed Video Action Recognition
寺尾颯人 - 2024 - eprints.lib.hokudai.ac.jp
As described in Chapter 2, deep networks have shown remarkable progress in video
classification [Haraetal., 2017, Tranetal., 2018, Feichtenhofer, 2020, Feichtenhofer et al …
classification [Haraetal., 2017, Tranetal., 2018, Feichtenhofer, 2020, Feichtenhofer et al …
[PDF][PDF] Research Statement: Scalable and Reliable Machine Learning with Language-guided Representation Learning
S Chun - sanghyukchun.github.io
Ensuring the real-world applicability of machine learning (ML) models poses a primary
challenge, namely, the ability to generalize effectively to unseen scenarios encountered …
challenge, namely, the ability to generalize effectively to unseen scenarios encountered …
[PDF][PDF] SeiT++: Masked Token Modeling Improves Storage-efficient Training (Supplementary Material)
MTM MAGE - ecva.net
To demonstrate the effectiveness of our token augmentation strategies, we explore another
token-based learning approach, MAGE [7]. MAGE introduced a unified training framework …
token-based learning approach, MAGE [7]. MAGE introduced a unified training framework …