Why macOS Is Underrepresented in Public AI Research Datasets

By Mariya Hirna, AI Research Manager at MacPaw


Last month the MacPaw Research team was at the DATA-FM Workshop within ICLR 2026 in Rio de Janeiro, where we presented our paper GUIrilla: A Scalable Framework for Automated Desktop UI Exploration. ICLR is one of the leading machine learning conferences in the world, and this was our first paper accepted there. I want to use this post to explain what we presented, why we think it matters now, and how other researchers and developers can use what we have released.

The short version of the argument is this. Over the past year, computer-use AI has moved from research demos into mainstream products. Major AI labs shipped computer-use products this spring, with explicit support for desktop control on macOS. Gartner predicts that 40% of enterprise applications will embed task-specific AI agents by the end of 2026, up from less than 5% in 2025. The category is no longer experimental.

What is less visible is that almost none of the open research data behind these systems comes from macOS. That has consequences for how well agents work on Mac. It is also the problem GUIrilla was built to address.

Why computer-use AI needs Mac data, and why the data has not been there

Computer-use AI learns by watching what happens on a screen. The training data is typically a mix of screenshots, accessibility metadata, and reasoning explanations, paired with tasks. Models trained on this data learn what buttons look like, how windows behave, how applications connect their screens, and what sequences of actions accomplish what goals. The more diverse and representative the data, the better the resulting agent performs across real applications.

The problem is that the open research datasets the field relies on are heavily skewed toward Windows and Android. In our analysis of OS-ATLAS, one of the largest publicly available synthetic datasets for computer-use AI with over 13 million GUI elements across multiple platforms, macOS accounts for just 0.06% of all samples. That is not a typo. Out of every ten thousand interface samples in the dataset, six are from Mac.

The reason is mostly technical. macOS does not expose its application interfaces in the same ways Windows or Android does. The accessibility APIs that exist are powerful, but working with them at scale requires specialist platform knowledge. Tooling to automate this kind of collection has not existed in any practical, open-source form. The field has built desktop AI with almost no Mac coverage, even though the platform itself is widely used.

The reason this matters in 2026 is that products are now shipping to Mac users at scale. Recent benchmarks such as OSWorld show major progress on computer-use task completion. The same agent running against macOS-specific workflows works with a fraction of the underlying knowledge. Closing that gap requires the same thing every other progress curve in AI has required, which is more and better open data.

What MacPaw Research built

GUIrilla comes out of our work in human-computer interaction, one of the broader directions MacPaw Research is involved in alongside our core focus on Local LLM Inference and AI Memory. The data gap on Mac was a direct blocker for the kind of HCI and agentic AI research we wanted to do, so we built the infrastructure ourselves.

We built three things and open-sourced all of them.

The first is GUIrilla, the framework the paper is named for. It is an automated system that installs macOS applications, navigates through them screen by screen, and maps everything it finds without any human annotation. The framework produces a graph-based representation of an application: which screens exist, how they connect, what interactive elements live on each, and what actions transition between them. The full implementation is on GitHub.

The second is GUIrilla-Task, the dataset that came out of running the framework at scale. It contains 27,171 tasks across 1,108 macOS applications, each paired with screenshots and structured interface data. We believe it is the largest publicly available dataset of Mac app interactions released to date. It is hosted on Hugging Face, free to use under permissive terms.

The third, and probably the most practical for the broader developer community, is macapptree. This is a small Python library that lets any developer or researcher extract the accessibility metadata of any Mac application in a clean, readable format. Buttons, menus, text fields, view hierarchies, how screens connect. The same structural layer that Apple originally built for screen readers, exposed in a format that AI systems and developers can actually work with. It requires no specialist Mac platform knowledge to use. We released it alongside the paper, and researchers outside MacPaw have already begun using it, because no comparable tool existed for Mac before. The code is on GitHub.

How developers and researchers can use this

The most direct entry point depends on what readers are working on.

For researchers training computer-use agents, GUIrilla-Task is a drop-in expansion of the macOS coverage in any existing computer-use training pipeline. Combined with existing datasets like OS-ATLAS or AndroidWorld, it provides the macOS slice that has been missing.

For researchers building UI-understanding benchmarks, the dataset includes both screenshots and structured accessibility data, which means it supports both vision-based and structure-based models. The GUIrilla-Trees companion dataset, which we released in March, extends this with large-scale accessibility tree data specifically.

For developers building anything that needs to programmatically understand a Mac application, macapptree is the lightest-weight option available. The original paper includes practical examples of using it for screen representation, vision-based accessibility generation, and UI search use cases.

Everything is open-source under permissive licenses. The paper is available on arXiv, and the full collection of datasets and models lives on the MacPaw Research page on Hugging Face.

Why this matters for the Mac ecosystem

The performance of computer-use AI on Mac is a research problem before it is a product problem. The models that ship in consumer and enterprise products are downstream of the data and tooling that exist in the open research community. If macOS continues to be underrepresented in that research, the agents that operate on Mac will continue to lag the agents that operate on Windows and Android, regardless of how good the underlying models become.

The broader shift the industry is calling Software 3.0 is, in practice, the shift to systems where AI agents take actions on behalf of users rather than only chatting with them. That shift cannot happen well on Mac without open, high-quality data about how Mac applications actually work. GUIrilla, GUIrilla-Task, and macapptree are our contribution to making that possible. We hope they are useful to others working in the same direction.

About MacPaw Research

MacPaw Research is the research unit of MacPaw,  a global technology company founded in Kyiv, Ukraine, with offices in Boston, MA and the EU, creating a digital ecosystem for Mac users. Its core focus is deep and applied research in Local LLM Inference and AI Memory, with broader directions such as human-computer interaction also in scope. The team is reachable through the MacPaw Research site.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.