Coral outperforms SOTA LLMs with SLMs using horizontal scaling paradigm

Proving small agents outperform Anthropic & OpenAI's largest models.

Introduction

Today Coral introduces a new way of scaling AI systems: horizontal scaling. In line with Nvidia's recent thesis, our graph based paradigm is proven today by our results on the GAIA benchmark by:

  • Beats Microsoft's Magentic-UI by an incredible 34%
  • Beats the UK Gov's AISI Inspect ReAct - the previous top scoring HAL GAIA system, which relies on Claude 3.5 Sonnet

Vertical scaling

Vertical scaling is the traditional approach to AI, where we build larger and larger models to handle more complex tasks. This has led to impressive results, but it also comes with limitations: It's more expensive, slower, and leads to scheming and alignment failures.

How it works

Inspired by CAMEL'S OWL, we built an application that uses Coral to instantiate a graph of agents for each task.

Each agent is created dynamically via a single POST request from a GAIA Application to the Coral Server, which handles parallel execution and secure agent communication using the Model Context Protocol (MCP).

Jump to technical details →

Our results on the GAIA Benchmark

The GAIA Benchmark is a multi-layered evaluation suite for advanced AI capabilities.

Read the GAIA benchmark paper →

Princeton HAL

For comparison purposes, we reference 2 separate GAIA Leaderboards. The Princeton HAL Leaderboards offers increased verification by way of having listees submit their agent code itself rather than just the answers. While the GAIA benchmark answers for the test set are not publicly available, we consider this to be the most authentic for holistically evaluating AI systems and paradigms.

GAIA Huggingface

At the time of publishing, Princeton HAL's leaderboard has few competing systems. Here we present our performance versus other systems.

Princeton HAL Results

Level 1
Level 2
Level 3

GAIA Huggingface (Mini models only) Results

Overall

Performance Visualization

Task performance across different categories of the GAIA benchmark, visualized to highlight patterns and insights. Click a bar to see the agent's task communications

Accuracies

Level 1:

71.70%

Level 2:

63.95%

Level 3:

23.08%

Overall:

60.00%

165 results

020,00040,00060,00080,000100,000120,000140,000

GAIA Question - 00d579ea-0889-4fd9-a771-2c8d79835c8d

Assuming scientists in the famous youtube video The Thinking Machine (Artificial Intelligence in the 1960s) were interviewed the same year, what is the name of the scientist predicting the sooner thinking machines or robots? Answer using the format First name Last name

Correct answerClaude Shannon
Our answerClaude Shannon
  1. The video “The Thinking Machine (Artificial Intelligence in the 1960s)” features interviews with three AI pioneers: Jerome Wiesner, Oliver Selfridge, and Claude Shannon. (Certainty: 95%; Source: Search agent results.)
  2. Among these, the only explicit near-term timeline prediction found is from Claude Shannon, who famously stated in his 1950 paper: “Within about ten years, machines should be capable of translating languages and performing certain intellectual tasks.” (Certainty: 90%; Source: Evidence #2.)
  3. No evidence was found in the video’s context or secondary sources that Jerome Wiesner or Oliver Selfridge made a shorter-term prediction about thinking machines or robots (e.g., sub-15-year horizon). (Certainty: 80%; Source: absence in searches and review by Web agent.)
  4. Therefore, Claude Shannon is the scientist from that set who predicted the sooner advent of thinking machines or robots. Overall confidence: 85%. === EVIDENCE AND SOURCES === [1] SOURCE: Search results show the YouTube video "The Thinking Machine (Artificial Intelligence in the 1960s)" features interviews with AI pioneers Jerome Wiesner, Oliver Selfridge, and Claude Shannon. [2] QUOTE: Claude Shannon’s familiar statement from his 1950 paper: “Within about ten years, machines should be capable of translating languages and performing certain intellectual tasks.”

Threads

Why Small Models?

Because Coordination > Size

We intentionally use small models in GAIA—proving that intelligent orchestration beats brute-force scale.

Benchmark LevelCoral Agents (GPT-4.1 mini)
Level 171.7%
Level 263.95%
Level 323.08%
Overall Accuracy60.0%

The takeaway: Coral lets you improve your systems - it lets agents freely collaborate to outperform vertical scale giants.

How far can it scale?

Coral is not a general-purpose chatbot wrapper. It's a protocol designed for systems where agent composition and roles are known ahead of time—like in enterprise AI services, agent-native applications, or dynamic microservices.

As more agents are coralized and connected through MCP, the possibilities compound. The GAIA benchmark shows what's already possible—this is just the beginning. Such strong performance on a benchmark designed for general purpose assistants gives us confidence in future benchmarks on more specific tasks, such as SWE-Bench. We are excited to see the real production systems built with Coral too.

Technical details

Multi-Agent System Architecture

This application demonstrates Coral's organizational capabilities through a sophisticated multi-agent architecture. Built on Coral's foundation, agents form dynamic organizational graphs and collaborate holistically through threaded communication—allowing focused collaboration without overwhelming complexity.

AgentDescriptionModelToolsDetails
🎯
Answer Finding Agent

Submits final answers and ensures proper formatting according to task requirements.

gpt-4.1-mini-2025-04-14
SendAnswerToolkit
🔎
Search Agent

Searches across multiple platforms and engines to find relevant information.

gpt-4.1-mini-2025-04-14
Multiple Search APIsJinaBrowsingToolkitWolframAlpha
🌐
Web Agent

Interacts with web pages, navigating interfaces and extracting structured data.

gpt-4.1-mini-2025-04-14
WebToolkitInteractive Element Detection
🖼️
Image Agent

Specialized in understanding and analyzing visual content across various formats.

gpt-4.1-mini-2025-04-14
ImageAnalysisToolkit
🎥
Video Agent

Processes and analyzes video content with audio integration capabilities.

o4-mini-2025-04-16
VideoAnalysisToolkitVideoDownloaderToolkit
🤖
Assistant Agent

General-purpose assistant with multiple capabilities for handling diverse tasks.

gpt-4.1-mini-2025-04-14
JinaBrowsingToolkitDocumentProcessingToolkitExcelToolkit
🧮
Problem Solving Agent

Specializes in mathematical problem solving and computational tasks.

o4-mini-2025-04-16
MathToolkitSymPyToolkitCodeExecutionToolkit
📋
Planning Agent

Coordinates complex tasks by breaking them down and managing agent activities.

gpt-4.1-mini-2025-04-14
Progress Monitoring FunctionsMCP Communication Toolkit
🔍
Critique Agent

Quality assurance agent that challenges assumptions and ensures evidence-backed claims.

gpt-4.1-mini-2025-04-14
MCP Communication Toolkit
🔗

Seamless Agent Extension

More specialized agents can be seamlessly added to improve the whole system. Any new agent simply needs to support adding the Coral server as an MCP server to automatically join the collaborative network and gain access to threaded communication with all existing agents.

🪸
Coral's Organizational Intelligence

Dynamic Graph Formation: Agents automatically organize into optimal collaboration patterns based on task requirements. Developers can also create custom organizational graphs to suit specific needs.
Threaded Communication: Slack-like conversation threads prevent information overload and enable focused collaboration
Holistic Coordination: Agents access the full system context without being overwhelmed by irrelevant peer communications by asking questions in threads
Extensible Architecture: New specialized agents can be seamlessly added to the organizational graph as needed
Intelligent Routing: Tasks are autonomously directed to the most appropriate agents based on their capabilities
Collaborative Learning: Agents learn from each other's expertise through structured knowledge sharing patterns
📄

Further Research

Further details about the system will be available when we publish the arxiv paper soon.

What's next?

Coral is just getting started. We have big plans to expand our ecosystem and make it easier for developers to build agent-native applications.

We are working on:

  • Remote agents and A2A payments
  • SWE-Bench scores

Get building

  • Integrate your own agents
  • Build multi-agent applications
  • Join the Coral developer ecosystem and help scale the agent economy

Get started with Coral Protocol

Join our community to stay updated: