Coral outperforms SOTA LLMs with SLMs using horizontal scaling paradigm

Proving small agents outperform Anthropic & OpenAI's largest models.

Introduction

Today Coral introduces a new way of scaling AI systems: horizontal scaling. In line with Nvidia's recent thesis, our graph based paradigm is proven today by our results on the GAIA benchmark by:

Beats Microsoft's Magentic-UI by an incredible 34%
Beats the UK Gov's AISI Inspect ReAct - the previous top scoring HAL GAIA system, which relies on Claude 3.5 Sonnet

Vertical scaling

Vertical scaling is the traditional approach to AI, where we build larger and larger models to handle more complex tasks. This has led to impressive results, but it also comes with limitations: It's more expensive, slower, and leads to scheming and alignment failures.

How it works

Inspired by CAMEL'S OWL, we built an application that uses Coral to instantiate a graph of agents for each task.

Each agent is created dynamically via a single POST request from a GAIA Application to the Coral Server, which handles parallel execution and secure agent communication using the Model Context Protocol (MCP).

Jump to technical details →

Our results on the GAIA Benchmark

The GAIA Benchmark is a multi-layered evaluation suite for advanced AI capabilities.

Read the GAIA benchmark paper →

Princeton HAL

For comparison purposes, we reference 2 separate GAIA Leaderboards. The Princeton HAL Leaderboards offers increased verification by way of having listees submit their agent code itself rather than just the answers. While the GAIA benchmark answers for the test set are not publicly available, we consider this to be the most authentic for holistically evaluating AI systems and paradigms.

GAIA Huggingface

At the time of publishing, Princeton HAL's leaderboard has few competing systems. Here we present our performance versus other systems.

Princeton HAL Results

Level 1

Level 2

Level 3

GAIA Huggingface (Mini models only) Results

Overall

Performance Visualization

Task performance across different categories of the GAIA benchmark, visualized to highlight patterns and insights. Click a bar to see the agent's task communications

Accuracies

Level 1:

71.70%

Level 2:

63.95%

Level 3:

23.08%

Overall:

60.00%

165 results

GAIA Question - 00d579ea-0889-4fd9-a771-2c8d79835c8d

Assuming scientists in the famous youtube video The Thinking Machine (Artificial Intelligence in the 1960s) were interviewed the same year, what is the name of the scientist predicting the sooner thinking machines or robots? Answer using the format First name Last name

Correct answer	Claude Shannon
Our answer	Claude Shannon

The video “The Thinking Machine (Artificial Intelligence in the 1960s)” features interviews with three AI pioneers: Jerome Wiesner, Oliver Selfridge, and Claude Shannon. (Certainty: 95%; Source: Search agent results.)
Among these, the only explicit near-term timeline prediction found is from Claude Shannon, who famously stated in his 1950 paper: “Within about ten years, machines should be capable of translating languages and performing certain intellectual tasks.” (Certainty: 90%; Source: Evidence #2.)
No evidence was found in the video’s context or secondary sources that Jerome Wiesner or Oliver Selfridge made a shorter-term prediction about thinking machines or robots (e.g., sub-15-year horizon). (Certainty: 80%; Source: absence in searches and review by Web agent.)
Therefore, Claude Shannon is the scientist from that set who predicted the sooner advent of thinking machines or robots. Overall confidence: 85%. === EVIDENCE AND SOURCES === [1] SOURCE: Search results show the YouTube video "The Thinking Machine (Artificial Intelligence in the 1960s)" features interviews with AI pioneers Jerome Wiesner, Oliver Selfridge, and Claude Shannon. [2] QUOTE: Claude Shannon’s familiar statement from his 1950 paper: “Within about ten years, machines should be capable of translating languages and performing certain intellectual tasks.”

Threads

Why Small Models?

Because Coordination > Size

We intentionally use small models in GAIA—proving that intelligent orchestration beats brute-force scale.

Benchmark Level	Coral Agents (GPT-4.1 mini)
Level 1	71.7%
Level 2	63.95%
Level 3	23.08%
Overall Accuracy	60.0%

The takeaway: Coral lets you improve your systems - it lets agents freely collaborate to outperform vertical scale giants.

How far can it scale?

Coral is not a general-purpose chatbot wrapper. It's a protocol designed for systems where agent composition and roles are known ahead of time—like in enterprise AI services, agent-native applications, or dynamic microservices.

As more agents are coralized and connected through MCP, the possibilities compound. The GAIA benchmark shows what's already possible—this is just the beginning. Such strong performance on a benchmark designed for general purpose assistants gives us confidence in future benchmarks on more specific tasks, such as SWE-Bench. We are excited to see the real production systems built with Coral too.

Technical details

Multi-Agent System Architecture

This application demonstrates Coral's organizational capabilities through a sophisticated multi-agent architecture. Built on Coral's foundation, agents form dynamic organizational graphs and collaborate holistically through threaded communication—allowing focused collaboration without overwhelming complexity.

Agent	Description	Model	Tools
🎯 Answer Finding Agent	Submits final answers and ensures proper formatting according to task requirements.	`gpt-4.1-mini-2025-04-14`	SendAnswerToolkit
🔎 Search Agent	Searches across multiple platforms and engines to find relevant information.	`gpt-4.1-mini-2025-04-14`	Multiple Search APIsJinaBrowsingToolkitWolframAlpha
🌐 Web Agent	Interacts with web pages, navigating interfaces and extracting structured data.	`gpt-4.1-mini-2025-04-14`	WebToolkitInteractive Element Detection
🖼️ Image Agent	Specialized in understanding and analyzing visual content across various formats.	`gpt-4.1-mini-2025-04-14`	ImageAnalysisToolkit
🎥 Video Agent	Processes and analyzes video content with audio integration capabilities.	`o4-mini-2025-04-16`	VideoAnalysisToolkitVideoDownloaderToolkit
🤖 Assistant Agent	General-purpose assistant with multiple capabilities for handling diverse tasks.	`gpt-4.1-mini-2025-04-14`	JinaBrowsingToolkitDocumentProcessingToolkitExcelToolkit
🧮 Problem Solving Agent	Specializes in mathematical problem solving and computational tasks.	`o4-mini-2025-04-16`	MathToolkitSymPyToolkitCodeExecutionToolkit
📋 Planning Agent	Coordinates complex tasks by breaking them down and managing agent activities.	`gpt-4.1-mini-2025-04-14`	Progress Monitoring FunctionsMCP Communication Toolkit
🔍 Critique Agent	Quality assurance agent that challenges assumptions and ensures evidence-backed claims.	`gpt-4.1-mini-2025-04-14`	MCP Communication Toolkit

🔗

Seamless Agent Extension

More specialized agents can be seamlessly added to improve the whole system. Any new agent simply needs to support adding the Coral server as an MCP server to automatically join the collaborative network and gain access to threaded communication with all existing agents.

🪸
Coral's Organizational Intelligence

Dynamic Graph Formation: Agents automatically organize into optimal collaboration patterns based on task requirements. Developers can also create custom organizational graphs to suit specific needs.

Threaded Communication: Slack-like conversation threads prevent information overload and enable focused collaboration

Holistic Coordination: Agents access the full system context without being overwhelmed by irrelevant peer communications by asking questions in threads

Extensible Architecture: New specialized agents can be seamlessly added to the organizational graph as needed

Intelligent Routing: Tasks are autonomously directed to the most appropriate agents based on their capabilities

Collaborative Learning: Agents learn from each other's expertise through structured knowledge sharing patterns

📄

Further Research

Further details about the system will be available when we publish the arxiv paper soon.

What's next?

Coral is just getting started. We have big plans to expand our ecosystem and make it easier for developers to build agent-native applications.

We are working on:

Remote agents and A2A payments
SWE-Bench scores

Get building

Integrate your own agents
Build multi-agent applications
Join the Coral developer ecosystem and help scale the agent economy

Get started with Coral Protocol

Join our community to stay updated:

Coral outperforms SOTA LLMs with SLMs using horizontal scaling paradigm

Introduction

Vertical scaling

How it works

Our results on the GAIA Benchmark

Princeton HAL

GAIA Huggingface

Princeton HAL Results

GAIA Huggingface (Mini models only) Results

Performance Visualization

GAIA Question - 00d579ea-0889-4fd9-a771-2c8d79835c8d

Human steps

Answer justification

Threads

Why Small Models?

How far can it scale?

Technical details

Multi-Agent System Architecture

Seamless Agent Extension

🪸
Coral's Organizational Intelligence

Further Research

What's next?

Get building

Introduction

Vertical scaling

How it works

Our results on the GAIA Benchmark

Princeton HAL

GAIA Huggingface

Performance Visualization

GAIA Question - 00d579ea-0889-4fd9-a771-2c8d79835c8d

Human steps

Answer justification

Threads

Why Small Models?

How far can it scale?

Technical details

Multi-Agent System Architecture

Seamless Agent Extension

🪸 Coral's Organizational Intelligence

Further Research

What's next?

Get building

🪸
Coral's Organizational Intelligence