Human action recognition from various data modalities: A review
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …
each action. It has a wide range of applications, and therefore has been attracting increasing …
Conversational agents in therapeutic interventions for neurodevelopmental disorders: a survey
Neurodevelopmental Disorders (NDD) are a group of conditions with onset in the
developmental period characterized by deficits in the cognitive and social areas …
developmental period characterized by deficits in the cognitive and social areas …
Palm-e: An embodied multimodal language model
Large language models excel at a wide range of complex tasks. However, enabling general
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …
Rt-1: Robotics transformer for real-world control at scale
A Brohan, N Brown, J Carbajal, Y Chebotar… - arXiv preprint arXiv …, 2022 - arxiv.org
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine
learning models can solve specific downstream tasks either zero-shot or with small task …
learning models can solve specific downstream tasks either zero-shot or with small task …
[HTML][HTML] Rt-2: Vision-language-action models transfer web knowledge to robotic control
We study how vision-language models trained on Internet-scale data can be incorporated
directly into end-to-end robotic control to boost generalization and enable emergent …
directly into end-to-end robotic control to boost generalization and enable emergent …
Groupvit: Semantic segmentation emerges from text supervision
Grouping and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …
object detection and semantic segmentation. With end-to-end deep learning systems …
Expanding language-image pretrained models for general video recognition
Contrastive language-image pretraining has shown great success in learning visual-textual
joint representation from web-scale data, demonstrating remarkable “zero-shot” …
joint representation from web-scale data, demonstrating remarkable “zero-shot” …
mplug-2: A modularized multi-modal foundation model across text, image and video
Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …
Masked feature prediction for self-supervised visual pre-training
Abstract We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training
of video models. Our approach first randomly masks out a portion of the input sequence and …
of video models. Our approach first randomly masks out a portion of the input sequence and …
Florence: A new foundation model for computer vision
Automated visual understanding of our diverse and open world demands computer vision
models to generalize well with minimal customization for specific tasks, similar to human …
models to generalize well with minimal customization for specific tasks, similar to human …