Coral outperforms SOTA LLMs with SLMs using horizontal scaling paradigm
Proving small agents outperform Anthropic & OpenAI's largest models.
Introduction
Today Coral introduces a new way of scaling AI systems: horizontal scaling. In line with Nvidia's recent thesis, our graph based paradigm is proven today by our results on the GAIA benchmark by:
- Beats Microsoft's Magentic-UI by an incredible 34%
- Beats the UK Gov's AISI Inspect ReAct - the previous top scoring HAL GAIA system, which relies on Claude 3.5 Sonnet
Vertical scaling
Vertical scaling is the traditional approach to AI, where we build larger and larger models to handle more complex tasks. This has led to impressive results, but it also comes with limitations: It's more expensive, slower, and leads to scheming and alignment failures.
How it works
Inspired by CAMEL'S OWL, we built an application that uses Coral to instantiate a graph of agents for each task.
Each agent is created dynamically via a single POST request from a GAIA Application to the Coral Server, which handles parallel execution and secure agent communication using the Model Context Protocol (MCP).
Jump to technical details →Our results on the GAIA Benchmark
The GAIA Benchmark is a multi-layered evaluation suite for advanced AI capabilities.
Read the GAIA benchmark paper →Princeton HAL
For comparison purposes, we reference 2 separate GAIA Leaderboards. The Princeton HAL Leaderboards offers increased verification by way of having listees submit their agent code itself rather than just the answers. While the GAIA benchmark answers for the test set are not publicly available, we consider this to be the most authentic for holistically evaluating AI systems and paradigms.
GAIA Huggingface
At the time of publishing, Princeton HAL's leaderboard has few competing systems. Here we present our performance versus other systems.
Princeton HAL Results
GAIA Huggingface (Mini models only) Results
Performance Visualization
Task performance across different categories of the GAIA benchmark, visualized to highlight patterns and insights. Click a bar to see the agent's task communications
Accuracies
Level 1:
71.70%
Level 2:
63.95%
Level 3:
23.08%
Overall:
60.00%
165 results
GAIA Question - 00d579ea-0889-4fd9-a771-2c8d79835c8d
Assuming scientists in the famous youtube video The Thinking Machine (Artificial Intelligence in the 1960s) were interviewed the same year, what is the name of the scientist predicting the sooner thinking machines or robots? Answer using the format First name Last name
Correct answer | Claude Shannon |
Our answer | Claude Shannon |
- Search "The Thinking Machine (Artificial Intelligence in the 1960s)" and open the YouTube result
- Listen to the video.
- Search for a transcript to confirm, due to struggling to feel confident in my answer.
- Fail to find a transcript.
- Watch again, finding again that Claude Shannon predicted AI in 5-10 years, which is the soonest.
- The video “The Thinking Machine (Artificial Intelligence in the 1960s)” features interviews with three AI pioneers: Jerome Wiesner, Oliver Selfridge, and Claude Shannon. (Certainty: 95%; Source: Search agent results.)
- Among these, the only explicit near-term timeline prediction found is from Claude Shannon, who famously stated in his 1950 paper: “Within about ten years, machines should be capable of translating languages and performing certain intellectual tasks.” (Certainty: 90%; Source: Evidence #2.)
- No evidence was found in the video’s context or secondary sources that Jerome Wiesner or Oliver Selfridge made a shorter-term prediction about thinking machines or robots (e.g., sub-15-year horizon). (Certainty: 80%; Source: absence in searches and review by Web agent.)
- Therefore, Claude Shannon is the scientist from that set who predicted the sooner advent of thinking machines or robots. Overall confidence: 85%. === EVIDENCE AND SOURCES === [1] SOURCE: Search results show the YouTube video "The Thinking Machine (Artificial Intelligence in the 1960s)" features interviews with AI pioneers Jerome Wiesner, Oliver Selfridge, and Claude Shannon. [2] QUOTE: Claude Shannon’s familiar statement from his 1950 paper: “Within about ten years, machines should be capable of translating languages and performing certain intellectual tasks.”
Threads
Why Small Models?
Because Coordination > Size
We intentionally use small models in GAIA—proving that intelligent orchestration beats brute-force scale.
Benchmark Level | Coral Agents (GPT-4.1 mini) |
Level 1 | 71.7% |
Level 2 | 63.95% |
Level 3 | 23.08% |
Overall Accuracy | 60.0% |
The takeaway: Coral lets you improve your systems - it lets agents freely collaborate to outperform vertical scale giants.
How far can it scale?
Coral is not a general-purpose chatbot wrapper. It's a protocol designed for systems where agent composition and roles are known ahead of time—like in enterprise AI services, agent-native applications, or dynamic microservices.
As more agents are coralized and connected through MCP, the possibilities compound. The GAIA benchmark shows what's already possible—this is just the beginning. Such strong performance on a benchmark designed for general purpose assistants gives us confidence in future benchmarks on more specific tasks, such as SWE-Bench. We are excited to see the real production systems built with Coral too.
Technical details
Multi-Agent System Architecture
This application demonstrates Coral's organizational capabilities through a sophisticated multi-agent architecture. Built on Coral's foundation, agents form dynamic organizational graphs and collaborate holistically through threaded communication—allowing focused collaboration without overwhelming complexity.
Agent | Description | Model | Tools | Details |
---|---|---|---|---|
🎯 Answer Finding Agent | Submits final answers and ensures proper formatting according to task requirements. | gpt-4.1-mini-2025-04-14 | SendAnswerToolkit | |
🔎 Search Agent | Searches across multiple platforms and engines to find relevant information. | gpt-4.1-mini-2025-04-14 | Multiple Search APIsJinaBrowsingToolkitWolframAlpha | |
🌐 Web Agent | Interacts with web pages, navigating interfaces and extracting structured data. | gpt-4.1-mini-2025-04-14 | WebToolkitInteractive Element Detection | |
🖼️ Image Agent | Specialized in understanding and analyzing visual content across various formats. | gpt-4.1-mini-2025-04-14 | ImageAnalysisToolkit | |
🎥 Video Agent | Processes and analyzes video content with audio integration capabilities. | o4-mini-2025-04-16 | VideoAnalysisToolkitVideoDownloaderToolkit | |
🤖 Assistant Agent | General-purpose assistant with multiple capabilities for handling diverse tasks. | gpt-4.1-mini-2025-04-14 | JinaBrowsingToolkitDocumentProcessingToolkitExcelToolkit | |
🧮 Problem Solving Agent | Specializes in mathematical problem solving and computational tasks. | o4-mini-2025-04-16 | MathToolkitSymPyToolkitCodeExecutionToolkit | |
📋 Planning Agent | Coordinates complex tasks by breaking them down and managing agent activities. | gpt-4.1-mini-2025-04-14 | Progress Monitoring FunctionsMCP Communication Toolkit | |
🔍 Critique Agent | Quality assurance agent that challenges assumptions and ensures evidence-backed claims. | gpt-4.1-mini-2025-04-14 | MCP Communication Toolkit |
Seamless Agent Extension
More specialized agents can be seamlessly added to improve the whole system. Any new agent simply needs to support adding the Coral server as an MCP server to automatically join the collaborative network and gain access to threaded communication with all existing agents.
🪸 Coral's Organizational Intelligence
Further Research
Further details about the system will be available when we publish the arxiv paper soon.
What's next?
Coral is just getting started. We have big plans to expand our ecosystem and make it easier for developers to build agent-native applications.
We are working on:
- Remote agents and A2A payments
- SWE-Bench scores
Get building
- Integrate your own agents
- Build multi-agent applications
- Join the Coral developer ecosystem and help scale the agent economy
Get started with Coral Protocol
Join our community to stay updated: