Mm-llms: Recent advances in multimodal large language models
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …
Scaffolding coordinates to promote vision-language coordination in large multi-modal models
State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional
capabilities in vision-language tasks. Despite their advanced functionalities, the …
capabilities in vision-language tasks. Despite their advanced functionalities, the …
Dual-View Visual Contextualization for Web Navigation
Automatic web navigation aims to build a web agent that can follow language instructions to
execute complex and diverse tasks on real-world websites. Existing work primarily takes …
execute complex and diverse tasks on real-world websites. Existing work primarily takes …
Towards general computer control: A multimodal agent for red dead redemption ii as a case study
Despite the success in specific tasks and scenarios, existing foundation agents, empowered
by large models (LMs) and advanced tools, still cannot generalize to different scenarios …
by large models (LMs) and advanced tools, still cannot generalize to different scenarios …
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but
evaluating their performance in the web domain remains a challenge due to the lack of …
evaluating their performance in the web domain remains a challenge due to the lack of …
Large Language Model-based Human-Agent Collaboration for Complex Task Solving
In recent developments within the research community, the integration of Large Language
Models (LLMs) in creating fully autonomous agents has garnered significant interest …
Models (LLMs) in creating fully autonomous agents has garnered significant interest …
AndroidWorld: A dynamic benchmarking environment for autonomous agents
C Rawles, S Clinckemaillie, Y Chang, J Waltz… - arXiv preprint arXiv …, 2024 - arxiv.org
Autonomous agents that execute human tasks by controlling computers can enhance
human productivity and application accessibility. Yet, progress in this field will be driven by …
human productivity and application accessibility. Yet, progress in this field will be driven by …
Automating the Enterprise with Foundation Models
M Wornow, A Narayan, K Opsahl-Ong… - arXiv preprint arXiv …, 2024 - arxiv.org
Automating enterprise workflows could unlock $4 trillion/year in productivity gains. Despite
being of interest to the data management community for decades, the ultimate vision of end …
being of interest to the data management community for decades, the ultimate vision of end …
MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning
While multi-modal large language models (MLLMs) have shown significant progress on
many popular visual reasoning benchmarks, whether they possess abstract visual …
many popular visual reasoning benchmarks, whether they possess abstract visual …
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks
Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating
models on business process management (BPM) tasks. BPM is the practice of documenting …
models on business process management (BPM) tasks. BPM is the practice of documenting …