MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Foundations and recent trends in multimodal mobile agents: A survey

B Wu, Y Li, M Fang, Z Song, Z Zhang, Y Wei… - arXiv preprint arXiv …, 2024 - arxiv.org
Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …

Navigating the digital world as humans do: Universal visual grounding for gui agents

B Gou, R Wang, B Zheng, Y Xie, C Chang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) are transforming the capabilities of graphical
user interface (GUI) agents, facilitating their transition from controlled simulations to …

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Windows agent arena: Evaluating multi-modal os agents at scale

R Bonatti, D Zhao, F Bonacci, D Dupont… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) show remarkable potential to act as computer agents,
enhancing human productivity and software accessibility in multi-modal tasks that require …

Tinyagent: Function calling at the edge

LE Erdogan, N Lee, S Jha, S Kim, R Tabrizi… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent large language models (LLMs) have enabled the development of advanced agentic
systems that can integrate various tools and APIs to fulfill user queries through function …

Showui: One vision-language-action model for generalist gui agent

KQ Lin, L Li, D Gao, Z Yang, Z Bai, W Lei… - … 2024 Workshop on …, 2024 - openreview.net
Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …

Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents

T Wang, Z Wu, J Liu, J Hao, J Wang, K Shao - arXiv preprint arXiv …, 2024 - arxiv.org
On-device control agents, especially on mobile devices, are responsible for operating
mobile devices to fulfill users' requests, enabling seamless and intuitive interactions …

Turn every application into an agent: Towards efficient human-agent-computer interaction with api-first llm-based agents

J Lu, Z Zhang, F Yang, J Zhang, L Wang, C Du… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) have enabled LLM-based agents to directly
interact with application user interfaces (UIs), enhancing agents' performance in complex …

Oscar: Operating system control via state-aware reasoning and re-planning

X Wang, B Liu - arXiv preprint arXiv:2410.18963, 2024 - arxiv.org
Large language models (LLMs) and large multimodal models (LMMs) have shown great
potential in automating complex tasks like web browsing and gaming. However, their ability …