Open-source model benchmark for custom tool use

TL;DR: Hugging Face published a benchmark evaluating open-source models on custom tool use tasks. Open models still lag behind proprietary ones in complex agent scenarios, with a gap of up to 40 percentage points.

What happened?

Hugging Face has launched a new benchmark called "Is it agentic enough?", designed to evaluate the ability of open-source language models to use custom tooling — that is, user-defined tools and APIs — rather than just static datasets. The benchmark focuses on tasks such as function calls, tool chaining, and multi-step reasoning, reflecting real-world use in agent applications. According to the official announcement on the Hugging Face blog, the benchmark consists of 1,000 tasks covering 12 different tools, including calculator, web search, SQL database, file reading, and weather API. Each task requires the model to understand the tool, decide when to use it, and execute the correct sequence of calls.

Why is it important?

Until now, most benchmarks (like MMLU or GSM8K) measure static knowledge or pure reasoning, but not a model's ability to interact with external tools, which is key for AI assistants, automation, and autonomous agents. This new benchmark fills a critical gap: it allows developers and companies to compare open models against proprietary ones (like GPT-4, Claude) in a realistic tool use scenario. In 2023, benchmarks like ToolBench and API-Bank already explored tool use, but with limited scope (ToolBench only covered 5 tools and API-Bank focused on web service APIs). Hugging Face's benchmark is the first to be integrated into the platform as an official leaderboard, giving it visibility and credibility. It also coincides with the rise of agent frameworks like LangChain, AutoGPT, and CrewAI, which demand models capable of orchestrating tools autonomously. According to GitHub data, LangChain surpassed 100,000 stars in 2024, reflecting growing interest in agent applications.

Key results

According to the Hugging Face blog, the top open models (like Llama 3.1 70B, Qwen2.5 72B) achieve scores between 60-70% on simple tasks, but drop to 30-40% on tasks requiring multiple steps with dependencies between tools. In contrast, GPT-4 and Claude 3.5 score over 85% on the same tests. This indicates a significant gap in agentic reasoning and context handling. For example, in tasks involving using a calculator and then a web search with the result, open models often fail to retain intermediate context. The benchmark also reveals that smaller models (7B-13B) perform below 20% on complex tasks, suggesting scale remains an important factor for tool use. Hugging Face published a detailed table with results from 15 models, where Llama 3.1 70B scores 68% on simple tasks and 35% on complex ones, while GPT-4 achieves 92% and 86% respectively.

Practical consequences

For startups and companies: If they rely on open models to build agents or automations, they will need to invest in prompt engineering, fine-tuning, or hybrid architectures to compensate for limitations. For example, companies like Replit and Sourcegraph have already adopted open models for code assistants, but could face challenges when scaling to more complex tasks. The benchmark suggests that for critical applications, such as business process automation, proprietary models remain more reliable.
For the open-source community: The benchmark provides a clear roadmap for improvement: better multi-step reasoning, tool memory, and ability to follow complex instructions are needed. Initiatives like ToolAlpaca (which fine-tunes models with 40,000 tool use examples) could help close the gap, but still don't match proprietary model performance. Hugging Face has also released a training dataset of 50,000 examples for tool use, available in its GitHub repository.
For the AI market: Proprietary models maintain an advantage in agent scenarios, which could slow open-source adoption in critical automation applications. However, the benchmark also shows that open models like Qwen2.5 72B approach 70% on simple tasks, indicating that for less demanding use cases, open source can be a viable alternative. According to Gartner market analysis, 40% of enterprise applications are expected to incorporate AI agents by 2026, making this benchmark especially timely.

What should readers know?

The benchmark is reproducible and open (code and data available on GitHub), allowing any developer to test their own models. Hugging Face also offers an interactive tool to visualize results by task. However, the benchmark has limitations: it only covers a fixed set of tools (calculator, web search, SQL database, etc.) and does not evaluate safety or robustness against malicious inputs. Additionally, tasks are in English and in a specific format, which may not reflect all real-world use cases. Despite these limitations, the benchmark represents a significant step toward more realistic evaluation of agent capabilities. As the Hugging Face blog states: "The 'Is it agentic enough?' benchmark is a step forward in measuring real agent capabilities, but work remains to close the gap with closed models."

Historical context

The concept of tool use in language models is not new. In 2022, the launch of ChatGPT with plugins demonstrated the potential for models to interact with external tools. However, traditional benchmarks did not capture this ability. In 2023, ToolBench (from the University of Hong Kong) and API-Bank (from Microsoft) were the first to attempt measuring tool use, but with limitations: ToolBench only evaluated 5 tools and API-Bank relied on a small set of APIs. Hugging Face's benchmark improves in coverage (12 tools) and task complexity (includes dependencies between tools). Moreover, being integrated into the Hugging Face leaderboard allows standardized comparisons and periodic updates. This is crucial at a time when frameworks like LangChain and AutoGPT are popularizing agent development but lacked a unified metric to evaluate models. According to a report from analytics firm CB Insights, investment in AI agent startups reached $2.5 billion in 2024, underscoring the relevance of this benchmark.

Practical recommendations

If you are evaluating models for an automation or agent project, consider:

Using the benchmark to compare open models in your specific domain. For example, if your agent needs to query an SQL database and then send an email, the benchmark includes similar tasks that will help you predict performance.
Complementing with fine-tuning on tool use datasets (e.g., ToolAlpaca, Gorilla). The ToolAlpaca dataset, with 40,000 examples, has been shown to improve performance of models like Llama 2 by 15% on tool use tasks. Additionally, the Gorilla dataset (from UC Berkeley) focuses on machine learning APIs and could be useful for specific domains.
Considering proprietary models for critical tasks where reliability is paramount. According to benchmark results, GPT-4 and Claude 3.5 have an advantage of over 20 percentage points on complex tasks, which can be decisive in applications where a mistake costs time or money. For simple tasks, open models like Llama 3.1 70B may suffice, especially when combined with prompting techniques like ReAct or chain-of-thought.
Monitoring benchmark updates, as Hugging Face plans to add new tools and tasks in the future, enabling more comprehensive evaluation.

Are Open Models Agentic Enough? Hugging Face's New Benchmark