A comprehensive study of ChatGPT: advancements, limitations, and ethical considerations in natural language processing and cybersecurity
This paper presents an in-depth study of ChatGPT, a state-of-the-art language model that is
revolutionizing generative text. We provide a comprehensive analysis of its architecture …
revolutionizing generative text. We provide a comprehensive analysis of its architecture …
A survey on multi-modal summarization
The new era of technology has brought us to the point where it is convenient for people to
share their opinions over an abundance of platforms. These platforms have a provision for …
share their opinions over an abundance of platforms. These platforms have a provision for …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Zero-shot video question answering via frozen bidirectional language models
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …
data for training. Manual annotation of question and answers for videos, however, is tedious …
Orca: A distributed serving system for {Transformer-Based} generative models
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …
recently attracted huge interest, emphasizing the need for system support for serving models …
Merlot reserve: Neural script knowledge through vision and language and sound
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …
Language models with image descriptors are strong few-shot video-language learners
The goal of this work is to build flexible video-language models that can generalize to
various video-to-text tasks from few examples. Existing few-shot video-language learners …
various video-to-text tasks from few examples. Existing few-shot video-language learners …
Revealing single frame bias for video-and-language learning
Training an effective video-and-language model intuitively requires multiple frames as
model inputs. However, it is unclear whether using multiple frames is beneficial to …
model inputs. However, it is unclear whether using multiple frames is beneficial to …
Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …
Linearly mapping from image to text space
The extent to which text-only language models (LMs) learn to represent features of the non-
linguistic world is an open question. Prior work has shown that pretrained LMs can be taught …
linguistic world is an open question. Prior work has shown that pretrained LMs can be taught …