One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era
OpenAI has recently released GPT-4 (aka ChatGPT plus), which is demonstrated to be one
small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI) …
small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI) …
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Llama-adapter v2: Parameter-efficient visual instruction model
How to efficiently transform large language models (LLMs) into instruction followers is
recently a popular research direction, while training LLM for multi-modal reasoning remains …
recently a popular research direction, while training LLM for multi-modal reasoning remains …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Frozen in time: A joint video and image encoder for end-to-end retrieval
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models
We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …
image-context reasoning. This benchmark presents significant challenges to advanced large …
Large language models are visual reasoning coordinators
Visual reasoning requires multimodal perception and commonsense cognition of the world.
Recently, multiple vision-language models (VLMs) have been proposed with excellent …
Recently, multiple vision-language models (VLMs) have been proposed with excellent …
Meshed-memory transformer for image captioning
Transformer-based architectures represent the state of the art in sequence modeling tasks
like machine translation and language understanding. Their applicability to multi-modal …
like machine translation and language understanding. Their applicability to multi-modal …
Rstnet: Captioning with adaptive attention on visual and non-visual words
Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …
vision language tasks. Meanwhile, transformer-based models have shown remarkable …
Eigen-cam: Class activation map using principal components
MB Muhammad, M Yeasin - 2020 international joint conference …, 2020 - ieeexplore.ieee.org
Deep neural networks are ubiquitous due to the ease of developing models and their
influence on other domains. At the heart of this progress is convolutional neural networks …
influence on other domains. At the heart of this progress is convolutional neural networks …