AI and cloud: the real race is operational

TL;DR: Artificial intelligence is entering a phase where operations are more critical than models. Failures due to capacity limits, GPU sprawl, and uncontrolled costs echo the early cloud days. Companies must adopt visibility, observability, cost efficiency, and capacity management to succeed.

Over the past two years, the media noise around artificial intelligence has centered on the model race: who has the largest, fastest, or best-scoring model on benchmarks. Companies like OpenAI, Google, and Anthropic have fiercely competed to launch models with ever-increasing capabilities, setting new records on tests like MMLU or HumanEval. However, as AI moves from pilots to the core of products and workflows, a familiar pattern from the early days of cloud computing emerges: systems are more programmable than ever, but also much harder to run. According to telemetry data from thousands of production systems collected by observability platforms like Datadog and New Relic, nearly 1 in 20 AI requests fails when applications reach scale, and most of those failures come from operational limits such as rate quotas, concurrency limits, and capacity, not from model errors or poor accuracy. This phenomenon has been documented in 2024 reports by consulting firm Gartner, which notes that 65% of outages in AI applications are due to infrastructure issues, not algorithmic failures. Token usage has doubled among average users and multiplied for intensive users, increasing costs and straining infrastructure. For example, in enterprise chatbot applications, the cost per query can range from $0.01 to $0.10, and with millions of daily queries, costs quickly skyrocket.

Why is this important?

This shift in focus has profound implications for companies, startups, and the job market. GPU sprawl has become a real problem: fragmented fleets across clouds and on-premise clusters, with some GPUs underutilized and others saturated, with no clear correlation between GPU hours and business value. This echoes the uncontrolled spending and unpredictability of the early cloud days, when companies like Netflix or Dropbox had to reinvent their operations to survive. In Asia-Pacific, especially ASEAN, AI adoption is accelerating but operational maturity is uneven. Singapore is advancing in governance and observability, while Indonesia, Malaysia, and Thailand are deploying rapidly in customer service without consolidated operational practices, generating operational and cost debt. According to a 2024 IDC study, AI spending in APAC will grow 25% annually until 2027, but more than 40% of implementations will fail to meet return on investment goals due to poor operational management. This gap between adoption and operability is critical: companies that do not control their costs and failures will lose competitiveness to those that do.

Consequences for companies and users

Organizations that do not adopt the four key operational disciplines—visibility and attribution, observability, cost efficiency, and capacity management—will face service failures, runaway costs, and lack of trust. For example, a generative AI startup that does not implement prompt caching may see its inference costs multiply by 10, as has happened in some documented cases on developer forums. For users, this means that the quality of AI applications will increasingly depend on the underlying infrastructure rather than the model itself. A chatbot may have the best model in the world, but if latency is high or the system frequently goes down, the user experience will be terrible. Startups competing in AI will need to prioritize platform engineering and cost optimization to survive. In the job market, roles like AI platform engineer, AI reliability engineer, and AI cost analyst are seeing growing demand, with salaries exceeding $150,000 annually in the United States, according to 2024 LinkedIn data. The lack of professionals with these skills is a bottleneck for many companies.

The four operational disciplines

Visibility and attribution: You cannot operate what you cannot see. It is necessary to track the usage of each request, its cost, and its business impact. Tools like Helicone or LangSmith allow assigning costs to specific teams or products, avoiding billing surprises.
Observability: Beyond monitoring, it involves understanding system behavior in production, including latency, error rates, and bottlenecks. Platforms like Datadog offer specialized dashboards for AI that show metrics such as token generation time and cache hit rates.
Cost efficiency: Techniques like prompt caching, context engineering, and model tuning can drastically reduce token and GPU spending. For example, using smaller, specialized models can reduce costs by up to 80% without sacrificing performance, as demonstrated by a 2024 Stanford study.
Capacity management: Planning GPU and other resource allocation to avoid saturation or underutilization, with auto-scaling and load balancing policies. Companies like CoreWeave offer on-demand GPU solutions that allow dynamic scaling, but require good planning to avoid cost spikes.

What should readers know?

The model race is not over, but the real battlefield has shifted to operations. Companies that invest in robust AI platforms, with observability and cost optimization tools, will have a competitive advantage. IT professionals and developers must acquire skills in platform engineering, finops for AI, and infrastructure management. The job market will see growing demand for roles like AI platform engineer, AI reliability engineer, and AI cost analyst. According to a 2024 McKinsey report, companies that implement finops practices for AI reduce their inference costs by an average of 30-50%. Additionally, system reliability becomes a key differentiator: 99.9% availability can be the difference between retaining or losing customers in critical applications like medical diagnosis or algorithmic trading.

“AI is following the same path as the cloud: first the excitement, then the operational reality. Whoever masters operations will dominate AI.”

In summary, AI operations are the new battlefield. Companies that ignore this reality will be left behind, while those that adopt the four operational disciplines will not only survive but lead the next wave of innovation. The history of the cloud taught us that operational excellence is a growth enabler; now it is AI's turn.

AI Follows the Cloud Pattern: The Real Race Is Operational

Why is this important?

Consequences for companies and users

The four operational disciplines

What should readers know?

Keep reading