AI Cost As A First-Class Metric — Our Conversation With David Tepper, CEO And Cofounder Of Pay-i
As generative AI (genAI) moves from experimentation to enterprise-scale deployment, the conversation in most enterprises is shifting from “Can we use AI?” to “Are we using it wisely?” For AI leaders, managing cost is no longer a technical afterthought — it’s a strategic imperative. The economics of AI are uniquely volatile, shaped by dynamic usage patterns, evolving model architectures, and opaque pricing structures. Without a clear cost management strategy, organizations risk undermining the very ROI they seek to achieve.
Some AI enthusiasts may forge ahead with AI but favor speed and innovation over cost accounting. They might argue that AI cost and even ROI remains hard to pin down. But the reality is, to unlock sustainable value from genAI investments, leaders must treat cost as a first-class metric — on par with performance, accuracy, and innovation. So I took the case to David Tepper, CEO and cofounder of Pay-i, a provider in the AI and FinOps space, to get his take on AI cost management and what enterprise AI leaders need to know.
Michele Goetz: AI cost is a hot topic as enterprises deploy and scale new AI applications. Can you help them understand the way AI cost is calculated?
David Tepper: I see you’re starting things off with a loaded question! The short answer: It’s complex. Counting input and output tokens works fine when AI utilization consists of making single request/response calls to a single model with fixed pricing. However, it quickly grows in complexity when you’re using multiple models, vendors, agents, models distributed in different geographies, different modalities, using prepurchased capacity, and accounting for enterprise discounts.
- GenAI use: GenAI applications often use a variety of tools, services, and supporting frameworks. They leverage multiple models from multiple providers, all with prices that are changing frequently. As soon as you start using genAI distributed globally, costs change independently by region and locale. Modalities other than text are usually priced completely separately. And the SDKs of major model providers typically don’t return enough information to calculate those prices correctly without engineering effort.
- Prepurchased capacity: A cloud hyperscaler (in Azure, a “Provisioned Throughput Unit,” or, in AWS, a “Model Unit of Provisioned Throughput”) or a model provider (in OpenAI, “Reserved Capacity” or “Scale Units”) introduces fixed costs for a certain number of tokens per minute and/or requests per minute. This can be the most cost-effective means of using genAI at scale. However, multiple applications may be leveraging the prepurchased capacity simultaneously for a single objective, all sending varied requests. Calculating the cost for one request requires enterprises to separate traffic to correctly calculate the amortized costs.
- Prepurchased compute: You are typically purchasing compute capacity independent of the models you’re using. In other words, you’re paying for X amount of compute time per minute, and you can host different models on top of it. Each of those models will use different amounts of that compute, even if the token counts are identical.
Michele Goetz: Pricing and packaging of AI models is transparent on foundation model vendor websites. Many even come with calculators. And AI platforms are even coming with cost, model cost comparison, and forecasting to show the AI spend by model. Is this enough for enterprises to plan out their AI spend?
David Tepper: Let’s imagine the following. You are part of an enterprise, and you went to one of these static pricing calculators on a model host’s website. Every API request in your organization was using exactly one model from exactly one provider, only using text, and only in a single locale. Ahead of time, you went to every engineer who would use genAI in the company and calculated every request using the mean number of input and output tokens, and the standard deviation from that mean. You’d probably get a pretty accurate cost estimation and forecast.
But we don’t live in that world. Someone wants to use a new model from a different provider. Later, an engineer in some department makes a tweak to the prompts to improve the quality of the responses. A different engineer in a different department wants to call the model several more times as part of a larger workflow. Another adds error handling and retry logic. The model provider updates the model snapshot, and now the typical number of consumed tokens changes. And so on …
GenAI and large language model (LLM) spend is different from their cloud predecessors not only due to variability at runtime, but more impactfully, the models are extremely sensitive to change. Change a small part of an English-language sentence, and that update to the prompt can drastically change the unit economics of an entire product or feature offering.
Michele Goetz: New models coming into market, such as DeepSeek R1, promise cost reduction by using fewer resources and even running on CPU rather than GPU. Does that mean enterprises will see AI cost decrease?
David Tepper: There are a few things to tease out here. Pay-i has been tracking prices based on the parameter size of the models (not intelligence benchmarks) since 2022. The overall compute cost for inferencing LLMs of a fixed parameter size has been reducing at roughly 6.67% compounded monthly.
However, organizational spend on these models is rising at a far higher rate. Adoption is picking up and solutions are being deployed at scale. And the appetite for what these models can do, and the desire to leverage them for increasingly ambitious tasks, is also a key factor.
When ChatGPT was first released, GPT-3.5 had a maximum context of 4,096 tokens. The latest models are pushing context windows between 1 and 10 million tokens. So even if the price per token has gone down two orders of magnitude, many of today’s most compelling use cases are pushing larger and larger context, and thus the cost per request can even end up higher than it was a few years ago.
Michele Goetz: How should companies think about measuring the value they receive for their genAI investments? How do you think about measuring things like ROI or time saved by using an AI tool?
David Tepper: This is a burgeoning challenge, and there is no silver bullet answer. Enterprises leveraging these new-fangled AI tools needs to be a means to a measurable end. A toothpaste company doesn’t get a bump if they tack “AI” on the side of the tube. However, many common enterprise practices can be greatly expedited and made more efficient through the use of AI. So there’s a real need from these companies to leverage that.
Software companies may have the luxury of touting publicly that they’re using AI and the market will reward them with market “value.” This is temporary and more an indication of confidence from the market that you are not being left behind by the times. Eventually, the spend-to-revenue ratio will need to make sense for software companies also, but we’re not there yet.
Michele Goetz: Most enterprises are transitioning from AI POCs to pilots and MVPs in 2025. And some enterprises are ready to scale an AI pilot or MVP. What can enterprises expect as AI applications evolve and scale? Are there different approaches to manage AI cost over that journey?
David Tepper: The biggest new challenges that come with scale are around throughput and availability. GPUs are in low supply and high demand these days, so if you’re scaling a solution that uses a lot of compute (either high tokens per minute or requests per minute), you will start to hit throttling limits. This is particularly true during burst traffic.
To understand the impact on cost for a single use case in a single geographic region, imagine you purchase reserved capacity that lets you resolve 100 requests per minute for $100 per hour. Most of the time, this capacity is sufficient. However, for a few hours per day, during peak usage, the number of requests per minute jumps up to 150. Your users begin to experience failures due to capacity, and so you need to purchase more capacity.
Let’s look at two examples of possible capacity SKUs. You can purchase spot capacity on an hourly basis for $500 per hour. Or you can purchase a monthly subscription up front that equates to another $100 per hour. Let’s say you math everything out, and spot capacity is cheaper. It’s more expensive per hour, but you don’t need it for that many hours per day after all.
Then your primary capacity experiences an outage. It’s not you, it’s the provider. Happens all the time. Scrambling, you rapidly spin up more spot capacity at a huge cost, maybe even from a different provider. “Never again!” you tell yourself, and then you provision twice as much capacity as you need, from different sources, and load-balance between them. Now you no longer need the spot capacity to handle usage spikes; you’ll just spread it across your larger capacity pool.
At the end of the month, you realize that your costs have doubled (you doubled the capacity, after all), without anything changing on the product side. As growth continues, the ongoing calculus gets more complex and punishing. Outages hurt more. And capacity growth to accommodate surges needs to be done at a larger scale, with idle capacity cost increasing.
Companies I’ve spoken with that have large genAI compute requirements often can’t find enough capacity from a single provider in a given region, so they need to load-balance across several models from different sources — and manage prompts differently for each. The final costs are then highly dependent on many different runtime behaviors.
Michele Goetz: We are seeing the rise of AI agents and new reasoning models. How will this impact the future of AI cost, and what should enterprises do to prepare for these changes?
David Tepper: It is already true today that the “cost” of a genAI use case is not a number. It is a distribution, with likelihoods, expected values, and percentiles.
As agents gain “agency,” and start to increase their variability at runtime, this distribution widens. This becomes increasingly true when leveraging reasoning models. Forecasting the token utilization of an agent is akin to trying to forecast the amount of time a human will spend working on a novel problem.
Looking at it from that lens, sometimes our deliverable time can be predicted by our prior accomplishments. Sometimes things take unexpectedly longer or shorter. Sometimes you work for a while and come back with nothing — you hit a roadblock, but your employer still needs to cover your time. Sometimes you’re not available to solve a problem and someone else has to cover. Sometimes you finish the job poorly and it needs to be redone.
If the true promise of AI agents comes to fruition, then we’ll be dealing with many of the same HR and salary issues as we do today but at a pace and scale that the human workers of the world will need both tools and training to manage.
Michele Goetz: Are you saying AI agents are the new workforce? Is AI cost the new salary?
David Tepper: Yes and yes!
Stay tuned for Forrester’s framework to optimize AI cost for publishing shortly.