The Next Phase Of AI Innovation: Inference Time Scaling
Where to place your bets on Huang's "third scaling law"
Happy Sunday and welcome to Investing in AI. Today I want to write about an idea that goes by many names in the AI community. You may hear it called “test time compute,” “test time scaling,” “inference time compute,” “inference time scaling,” or even something else. But whatever you call it - the picture below from NVIDIA’s last developer conference explains it best. For purposes of this post, we will call it “inference time compute.”
At the conference, Jensen Huang called inference time compute the “third scaling law.” What he means is that when you want to make AI models, and LLMs in particular, better, there are 3 things you can do.
Pre-training - using more data and compute to generate a bigger smarter model
Post-training - tweaking a model for better contextual performance using fine tuning or prompt engineering
Inference time compute - using algorithms to help the model “think” longer to generate better outcomes
Inference time compute is a series of algorithms, the class of which is growing all the time, that go by names like “chain of thought,” “tree of thought,” “best of N,” “beam search,” “diverse verifier tree search,” “lookahead search” and more. Below is a visual example of beam search.
Since LLMs operate in probabilities, you can think of beam search as exploring multiple paths through the LLM at the same time. It expands the next step in the path on each cycle, then prunes down to the best couple of options.
Why is inference time compute so interesting? Because as we build AI agents, and these agents have an array of different inputs, each input may require a different inference time compute algorithm to get the best output.
Inference time compute also allows you to trade time for cost to generate the same performance. As this post shows, a small language model with more inference time compute power behind it can match or beat a much much larger LLM in terms of performance.
There are also classes of problems that are difficult to solve but easy to verify the solution. Using something open source like Llama or Qwen to solve the problem but then OpenAI to verify the solution can save time and money.
From an investor perspective, the inference time compute world has a few interesting traits.
The world of possible algorithms is largely unexplored. More inference time compute approaches are being invented weekly.
The space should be very fragmented. There isn’t a winner take all algorithm, and I expect to see hundreds of algorithms in common usage that map to different AI workloads and prompts and use cases.
The community of AI developers could outperform the foundation model labs at this task because, while testing a new training idea is expensive, testing a new inference time compute idea is not. It’s just a matter of creativity and experimentation.
“Inference time training” is predicted to be the next thing, where the configuration and feedback loops you get from your inference time compute configurations will be fed back into models to fine tune or re-train them.
If you want to learn more, here is a good video about why this concept is so important, and going to be so large and impactful on AI.
If you are an enterprise and want to try various inference time compute algorithms yourself on open source LLMs, at Neurometric we have a platform to make that really easy.
Thanks for reading.

