Vx2text: End-to-end learning of video-based text generation from multimodal inputs

A comprehensive study of ChatGPT: advancements, limitations, and ethical considerations in natural language processing and cybersecurity

M Alawida, S Mejri, A Mehmood, B Chikhaoui… - Information, 2023 - mdpi.com

This paper presents an in-depth study of ChatGPT, a state-of-the-art language model that is
revolutionizing generative text. We provide a comprehensive analysis of its architecture …

被引用次数：146 相关文章所有 6 个版本

[PDF] arxiv.org

A survey on multi-modal summarization

A Jangra, S Mukherjee, A Jatowt, S Saha… - ACM Computing …, 2023 - dl.acm.org

The new era of technology has brought us to the point where it is convenient for people to
share their opinions over an abundance of platforms. These platforms have a provision for …

被引用次数：62 相关文章所有 4 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：587 相关文章所有 9 个版本

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

被引用次数：224 相关文章所有 11 个版本

[PDF] usenix.org

Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org

Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

被引用次数：332 相关文章所有 6 个版本

[PDF] thecvf.com

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

被引用次数：266 相关文章所有 9 个版本

[PDF] neurips.cc

Language models with image descriptors are strong few-shot video-language learners

Z Wang, M Li, R Xu, L Zhou, J Lei… - Advances in …, 2022 - proceedings.neurips.cc

The goal of this work is to build flexible video-language models that can generalize to
various video-to-text tasks from few examples. Existing few-shot video-language learners …

被引用次数：126 相关文章所有 10 个版本

[PDF] arxiv.org

Revealing single frame bias for video-and-language learning

J Lei, TL Berg, M Bansal - arXiv preprint arXiv:2206.03428, 2022 - arxiv.org

Training an effective video-and-language model intuitively requires multiple frames as
model inputs. However, it is unclear whether using multiple frames is beneficial to …

被引用次数：125 相关文章所有 5 个版本

[PDF] arxiv.org

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

被引用次数：100 相关文章所有 3 个版本

[PDF] arxiv.org

Linearly mapping from image to text space

J Merullo, L Castricato, C Eickhoff, E Pavlick - arXiv preprint arXiv …, 2022 - arxiv.org

The extent to which text-only language models (LMs) learn to represent features of the non-
linguistic world is an open question. Prior work has shown that pretrained LMs can be taught …

被引用次数：100 相关文章所有 4 个版本