A systematic review on affective computing: Emotion models, databases, and recent advances
Affective computing conjoins the research topics of emotion recognition and sentiment
analysis, and can be realized with unimodal or multimodal data, consisting primarily of …
analysis, and can be realized with unimodal or multimodal data, consisting primarily of …
A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions
Human activity recognition (HAR) is one of the most important and challenging problems in
the computer vision. It has critical application in wide variety of tasks including gaming …
the computer vision. It has critical application in wide variety of tasks including gaming …
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …
achieve premier performance on relatively small datasets. In this paper, we show that video …
St-adapter: Parameter-efficient image-to-video transfer learning
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …
recently emerged with promising performance. Due to the ever-growing model size, the …
Simam: A simple, parameter-free attention module for convolutional neural networks
In this paper, we propose a conceptually simple but very effective attention module for
Convolutional Neural Networks (ConvNets). In contrast to existing channel-wise and spatial …
Convolutional Neural Networks (ConvNets). In contrast to existing channel-wise and spatial …
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
All in one: Exploring unified video-language pre-training
Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …
Swinbert: End-to-end transformers with sparse attention for video captioning
The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …
from offline-extracted dense video features. These feature extractors usually operate on …
Frozen in time: A joint video and image encoder for end-to-end retrieval
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
Actionclip: A new paradigm for video action recognition
The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …