33 metrics to evaluate LLMs and AI agents

TL;DR: InfoWorld has compiled 33 metrics to evaluate LLMs and AI agents, from time to first token to error rates. Standardization allows model comparison and deployment optimization. Companies should select metrics based on their use case.

What happened?

InfoWorld has published a guide compiling 33 metrics for evaluating large language models (LLMs) and AI agents. Among them are time to first token (TTFT), tokens per second, error rate, precision, recall, F1, perplexity, BLEU, ROUGE, METEOR, and cost and latency metrics. The list includes both traditional NLP indicators and new measures adapted to agentic systems. This guide is not a closed benchmark but a reference framework that AI teams can adapt to their needs. The initiative responds to a growing demand for transparency and comparability in a market where models from hundreds of providers proliferate, from giants like OpenAI, Google, Anthropic, and Meta to startups like Mistral AI or Cohere.

Why is it important?

As companies integrate LLMs and agents into critical processes—such as customer service, contract analysis, medical report generation, or algorithmic trading—having standardized metrics is essential for comparing models, optimizing costs, and ensuring quality. Without them, selection and deployment decisions rely on subjective impressions or limited benchmarks like MMLU or HumanEval, which do not cover operational aspects such as latency or cost efficiency. This guide provides a comprehensive framework covering everything from response speed to semantic coherence. Moreover, the lack of common metrics has led to incidents like a legal chatbot that hallucinated court precedents, underscoring the need to measure reliability. InfoWorld's guide arrives at a time when Gartner predicts that by 2026, 80% of organizations will have used language models in production, up from 30% today.

Key metrics analyzed

Speed and performance

Time to First Token (TTFT): critical for real-time applications. Users abandon if the response takes more than a few milliseconds. Studies by Google and Amazon show that a 100 ms delay reduces conversion by 1%.
Time per output token and tokens per second: measure the model's sustained speed, important for batch processing. For example, GPT-4 generates ~20 tokens/second, while smaller models like Llama 3 8B achieve 100+ tokens/second on optimized hardware.
Throughput (requests per minute): relevant in multi-user systems, where pipeline efficiency can make a difference. Companies like Shopify process millions of AI requests daily, and low throughput can crash the service.

Output quality

Precision, recall, and F1: classic metrics for classification and information extraction tasks. For example, in a spam detection system, precision measures false positives and recall measures false negatives.
Perplexity: measures how well the model predicts a sequence; lower values indicate better performance. A model with perplexity 10 is more predictable than one with 50, but it does not always correlate with perceived quality.
BLEU, ROUGE, METEOR: compare generated text with human references, useful for translation and summarization. BLEU is used in machine translation; ROUGE in summaries; METEOR considers synonyms and word order. However, these metrics have limitations: BLEU penalizes creativity and ROUGE may favor literal extracts.

Robustness and bias

Error rate: includes incorrect answers, hallucinations, and system failures. A study by Vectara found that LLMs hallucinate between 3% and 27% of the time in summarization tasks.
Bias and toxicity metrics: evaluate whether the model reproduces stereotypes or harmful language. Tools like Perspective API or the BBQ dataset measure racial, gender, and religious biases. OpenAI reported that GPT-4 reduces biases by 60% compared to GPT-3.5, but they still persist.

Cost and efficiency

Cost per token: fundamental for choosing between proprietary and open-source models. For example, GPT-4 costs ~$0.03 per 1k input tokens, while Llama 3 70B on an API like Together.ai costs ~$0.001. For a company processing one billion tokens per month, the difference is $30,000 vs. $1,000.
Memory and GPU usage: impact scalability and infrastructure cost. Models like Falcon 180B require 320 GB of memory, implying multiple 80 GB A100 GPUs. Quantization (FP16, INT8) reduces consumption but may affect accuracy.

Market implications

Standardizing metrics will facilitate objective comparison among providers like OpenAI, Anthropic, Google, and Meta. Startups developing evaluation tools (e.g., LangSmith, Weights & Biases, Arize AI) will benefit, as ML teams will need to integrate these metrics into their CI/CD pipelines for language models. Additionally, the guide encourages competition in areas like latency and cost, where open-source models can outperform proprietary ones. For instance, Mistral AI has shown that smaller, more efficient models can match GPT-3.5 on certain tasks at 80% lower cost. Cloud providers like AWS, Azure, and GCP will also be impacted, as they will offer managed services that include these metrics as part of their SLAs. In the regulatory sphere, the European Union, with its AI Act, could require bias and robustness metrics for high-risk models, giving this guide legal value.

What readers should know

There is no single metric; the choice depends on the use case. For chatbots, prioritize TTFT and error rate; for content generation, BLEU/ROUGE and human evaluation; for financial systems, precision and cost per token. Additionally, metrics should be monitored in production, not just in the lab. InfoWorld's guide is a starting point, but each organization must adapt its dashboard of indicators. It is crucial not to fall into the trap of optimizing a single metric (Goodhart's law): if only TTFT is measured, quality will be sacrificed. Finally, human evaluation remains irreplaceable for creative or sensitive tasks. The combination of automated metrics and periodic human review is the recommended practice. As InfoWorld notes, "you can't manage what you don't measure," but measuring correctly requires understanding what matters for each business.

33 key metrics to evaluate LLMs and AI agents