Foundations and recent trends in multimodal mobile agents: A survey

B Wu, Y Li, M Fang, Z Song, Z Zhang, Y Wei… - arXiv preprint arXiv …, 2024 - arxiv.org
Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Mindsearch: Mimicking human minds elicits deep ai searcher

Z Chen, K Liu, Q Wang, J Liu, W Zhang, K Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Information seeking and integration is a complex cognitive task that consumes enormous
time and effort. Inspired by the remarkable progress of Large Language Models, recent …

Caution for the environment: Multimodal agents are susceptible to environmental distractions

X Ma, Y Wang, Y Yao, T Yuan, A Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper investigates the faithfulness of multimodal large language model (MLLM) agents
in the graphical user interface (GUI) environment, aiming to address the research question …

Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents

T Wang, Z Wu, J Liu, J Hao, J Wang, K Shao - arXiv preprint arXiv …, 2024 - arxiv.org
On-device control agents, especially on mobile devices, are responsible for operating
mobile devices to fulfill users' requests, enabling seamless and intuitive interactions …

Agent-e: From autonomous web navigation to foundational design principles in agentic systems

T Abuelsaad, D Akkil, P Dey, A Jagmohan… - arXiv preprint arXiv …, 2024 - arxiv.org
AI Agents are changing the way work gets done, both in consumer and enterprise domains.
However, the design patterns and architectures to build highly capable agents or multi-agent …

Do multimodal foundation models understand enterprise workflows? a benchmark for business process management tasks

M Wornow, A Narayan, B Viggiano, IS Khare… - arXiv e …, 2024 - ui.adsabs.harvard.edu
Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating
models on business process management (BPM) tasks. BPM is the practice of documenting …

Autoglm: Autonomous foundation agents for guis

X Liu, B Qin, D Liang, G Dong, H Lai, H Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation
agents for autonomous control of digital devices through Graphical User Interfaces (GUIs) …

Towards a science exocortex

KG Yager - Digital Discovery, 2024 - pubs.rsc.org
Artificial intelligence (AI) methods are poised to revolutionize intellectual work, with
generative AI enabling automation of text analysis, text generation, and simple decision …

Lightweight Neural App Control

F Christianos, G Papoudakis, T Coste, J Hao… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper introduces a novel mobile phone control architecture, termed``app agents", for
efficient interactions and controls across various Android apps. The proposed Lightweight …