MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …
Models (MLLMs) have garnered increased attention from both industry and academia …
Foundations and recent trends in multimodal mobile agents: A survey
Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …
environments. As foundation models evolve, the demands for agents that can adapt in real …
Navigating the digital world as humans do: Universal visual grounding for gui agents
Multimodal large language models (MLLMs) are transforming the capabilities of graphical
user interface (GUI) agents, facilitating their transition from controlled simulations to …
user interface (GUI) agents, facilitating their transition from controlled simulations to …
Ferret-ui 2: Mastering universal user interface understanding across platforms
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …
various foundational issues, such as platform diversity, resolution variation, and data …
Windows agent arena: Evaluating multi-modal os agents at scale
Large language models (LLMs) show remarkable potential to act as computer agents,
enhancing human productivity and software accessibility in multi-modal tasks that require …
enhancing human productivity and software accessibility in multi-modal tasks that require …
Tinyagent: Function calling at the edge
Recent large language models (LLMs) have enabled the development of advanced agentic
systems that can integrate various tools and APIs to fulfill user queries through function …
systems that can integrate various tools and APIs to fulfill user queries through function …
Showui: One vision-language-action model for generalist gui agent
Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …
Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents
On-device control agents, especially on mobile devices, are responsible for operating
mobile devices to fulfill users' requests, enabling seamless and intuitive interactions …
mobile devices to fulfill users' requests, enabling seamless and intuitive interactions …
Turn every application into an agent: Towards efficient human-agent-computer interaction with api-first llm-based agents
Multimodal large language models (MLLMs) have enabled LLM-based agents to directly
interact with application user interfaces (UIs), enhancing agents' performance in complex …
interact with application user interfaces (UIs), enhancing agents' performance in complex …
Oscar: Operating system control via state-aware reasoning and re-planning
Large language models (LLMs) and large multimodal models (LMMs) have shown great
potential in automating complex tasks like web browsing and gaming. However, their ability …
potential in automating complex tasks like web browsing and gaming. However, their ability …