Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

Conversational agents in therapeutic interventions for neurodevelopmental disorders: a survey

F Catania, M Spitale, F Garzotto - ACM Computing Surveys, 2023 - dl.acm.org
Neurodevelopmental Disorders (NDD) are a group of conditions with onset in the
developmental period characterized by deficits in the cognitive and social areas …

Palm-e: An embodied multimodal language model

D Driess, F Xia, MSM Sajjadi, C Lynch… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models excel at a wide range of complex tasks. However, enabling general
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …

Rt-1: Robotics transformer for real-world control at scale

A Brohan, N Brown, J Carbajal, Y Chebotar… - arXiv preprint arXiv …, 2022 - arxiv.org
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine
learning models can solve specific downstream tasks either zero-shot or with small task …

[HTML][HTML] Rt-2: Vision-language-action models transfer web knowledge to robotic control

B Zitkovich, T Yu, S Xu, P Xu, T Xiao… - … on Robot Learning, 2023 - proceedings.mlr.press
We study how vision-language models trained on Internet-scale data can be incorporated
directly into end-to-end robotic control to boost generalization and enable emergent …

Groupvit: Semantic segmentation emerges from text supervision

J Xu, S De Mello, S Liu, W Byeon… - Proceedings of the …, 2022 - openaccess.thecvf.com
Grouping and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …

Expanding language-image pretrained models for general video recognition

B Ni, H Peng, M Chen, S Zhang, G Meng, J Fu… - … on Computer Vision, 2022 - Springer
Contrastive language-image pretraining has shown great success in learning visual-textual
joint representation from web-scale data, demonstrating remarkable “zero-shot” …

mplug-2: A modularized multi-modal foundation model across text, image and video

H Xu, Q Ye, M Yan, Y Shi, J Ye, Y Xu… - International …, 2023 - proceedings.mlr.press
Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …

Masked feature prediction for self-supervised visual pre-training

C Wei, H Fan, S Xie, CY Wu, A Yuille… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training
of video models. Our approach first randomly masks out a portion of the input sequence and …

Florence: A new foundation model for computer vision

L Yuan, D Chen, YL Chen, N Codella, X Dai… - arXiv preprint arXiv …, 2021 - arxiv.org
Automated visual understanding of our diverse and open world demands computer vision
models to generalize well with minimal customization for specific tasks, similar to human …