A comprehensive study of ChatGPT: advancements, limitations, and ethical considerations in natural language processing and cybersecurity

M Alawida, S Mejri, A Mehmood, B Chikhaoui… - Information, 2023 - mdpi.com
This paper presents an in-depth study of ChatGPT, a state-of-the-art language model that is
revolutionizing generative text. We provide a comprehensive analysis of its architecture …

A survey on multi-modal summarization

A Jangra, S Mukherjee, A Jatowt, S Saha… - ACM Computing …, 2023 - dl.acm.org
The new era of technology has brought us to the point where it is convenient for people to
share their opinions over an abundance of platforms. These platforms have a provision for …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

Language models with image descriptors are strong few-shot video-language learners

Z Wang, M Li, R Xu, L Zhou, J Lei… - Advances in …, 2022 - proceedings.neurips.cc
The goal of this work is to build flexible video-language models that can generalize to
various video-to-text tasks from few examples. Existing few-shot video-language learners …

Revealing single frame bias for video-and-language learning

J Lei, TL Berg, M Bansal - arXiv preprint arXiv:2206.03428, 2022 - arxiv.org
Training an effective video-and-language model intuitively requires multiple frames as
model inputs. However, it is unclear whether using multiple frames is beneficial to …

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

Linearly mapping from image to text space

J Merullo, L Castricato, C Eickhoff, E Pavlick - arXiv preprint arXiv …, 2022 - arxiv.org
The extent to which text-only language models (LMs) learn to represent features of the non-
linguistic world is an open question. Prior work has shown that pretrained LMs can be taught …