AI Agents Can Only Automate 2.5% of Real Remote Jobs – New Study Reveals the Surprising Gap
Businessday Ng1 day ago
900

AI Agents Can Only Automate 2.5% of Real Remote Jobs – New Study Reveals the Surprising Gap

TECHNOLOGY TRENDS
ai
remotework
automation
freelancing
research
Share this content:

Summary:

  • A new study introduces the Remote Labour Index (RLI), testing AI's ability to replace human freelancers in real-world remote work projects

  • The highest-performing AI model, Manus, achieved an automation rate of just 2.5%, with other leading systems scoring between 0.8% and 2.1%

  • The RLI evaluated 240 end-to-end freelance projects across 23 categories, representing over 6,000 hours of human labour valued at $140,000+

  • Common AI failures include incomplete outputs, corrupted files, poor professional quality, and inconsistencies across deliverables

  • Despite steady progress, AI systems remain far from automating economically valuable remote work, with over 97% of projects not meeting client standards

A groundbreaking study has introduced the Remote Labour Index (RLI), a new benchmark designed to test whether Artificial Intelligence can truly replace human freelancers. The findings reveal that today’s most advanced AI agents can complete less than 3% of real-world remote work projects, with the highest-performing model achieving an automation rate of just 2.5%.

The Remote Labour Index: A Real-World Test

Developed by researchers from the Center for AI Safety and Scale AI, the RLI is a dataset of 240 end-to-end freelance projects sourced from real professionals across multiple industries. Unlike many AI tests that focus on narrow tasks, the RLI evaluates full projects drawn directly from online freelance platforms, each including a brief, input files, and a ‘gold-standard’ human deliverable.

The projects span 23 categories of remote work and represent more than 6,000 hours of human labour valued at over $140,000. The average project took nearly 29 hours to complete and cost about $633.

AI Performance: Far from Human-Level

Despite rapid advances in reasoning and knowledge benchmarks, frontier AI systems remain far from automating economically valuable remote work. The highest-performing model, Manus, achieved an automation rate of 2.5%, meaning it produced work comparable to human freelancers on only a handful of projects. Other leading systems, including GPT-5, Claude Sonnet 4.5, Grok 4, ChatGPT agent, and Gemini 2.5 Pro, scored between 0.8% and 2.1%.

In practical terms, this means more than 97% of the projects—ranging from 3D product rendering and architectural design to game development, data visualisation, and video production—were not completed at a level that would be accepted by a paying client.

Common AI Failures

Researchers manually evaluated AI outputs against human work, using a holistic standard: would a reasonable client accept the AI’s submission as commissioned work? Inter-annotator agreement among evaluators exceeded 94%, suggesting strong reliability in the scoring process.

The study found recurring weaknesses in AI-generated deliverables, including:

  • Incomplete or truncated outputs
  • Corrupted or unusable files
  • Poor professional quality
  • Inconsistencies across assets

For instance, some AI systems produced videos far shorter than requested, child-like graphics for design tasks, or floor plans that failed to match supplied sketches.

Steady Progress but a Long Way to Go

While AI performed better on certain creative and text-heavy tasks such as audio editing, report writing, and basic data visualisation, these represented a small slice of the broader remote work economy.

The study’s Elo scoring system, which measures relative improvement between models, indicates steady progress. Newer models consistently outperformed older ones, suggesting incremental gains. However, all AI systems fell well below the human baseline score of 1,000 on the benchmark.

Implications for the Future of Remote Work

The findings may temper fears of immediate large-scale displacement of freelance digital workers, while also revealing the need to track AI progress using real-world economic metrics rather than theoretical performance alone. The researchers argue that the RLI provides a more economically grounded measure of AI capability than previous tests, offering policymakers and businesses a clearer picture of automation risks.

As one researcher noted, “Despite rapid progress on other AI benchmarks, current systems remain far from capable of autonomously handling the diverse and complex demands of the remote labor market.”

Comments

0

Join Our Community

Sign up to share your thoughts, engage with others, and become part of our growing community.

No comments yet

Be the first to share your thoughts and start the conversation!

Newsletter

Subscribe our newsletter to receive our daily digested news

Join our newsletter and get the latest updates delivered straight to your inbox.

OR
RemoteJobsHub.app logo

RemoteJobsHub.app

Get RemoteJobsHub.app on your phone!