Happy New Year’s Eve and welcome to Investing in AI. I know most newsletters will wrap up the year with their 2025 predictions for AI, and while I’ve done that in the past, today I want to do something different. I want to talk about what it takes to get to AGI, which is probably still a few years away.
I will warn you in advance that this post won’t make sense to some of you, as it requires some background to understand that not everyone will have.
To get from where we are today to AGI, I am going to tie together 5 important ideas:
The trend towards mixture of experts models.
The idea of positional embeddings in transformers.
The explosion of AI hardware.
The idea of neural oscillation.
The ideas of symbolic recursion in AI from Doug Hofstaedter’s Godel, Escher, Bach: An Eternal Golden Braid.
Before we go down that path, let me explain why I’ve never been a believer in LLM scaling laws and the belief that more data will get us all the way to AGI. It’s pretty simple. Human intelligence evolved before text data. It evolved before digital data of any type. And it did so in a world where all the other animals had access to similar input data from the world. That leads me to believe that intelligence is a computational issue more than a data one. I realize that isn’t the common point of view in the AI ecosystem at the moment, but I stand by it and think this point of view will eventually be vindicated.
The ideas above are important and I want to cover them one by one and explain how these principles, together, extrapolate to what I think will create AGI.
The trend towards MoE models
Mixture of Experts models are a trend in foundation models that became popular about 18 months ago. The idea is that instead of training one giant model to do everything, you train several smaller models, or you find tune small sub-networks of larger models to specialize in one thing. This makes training faster and easier.
At inference time, an input is either routed to the most appropriate model or, in some cases, routed to multiple models who “vote” on the best output. The reason this concept is important is because it acknowledges that the output of the summation of small specialized models may be better than one giant model. I believe to get to AGI we must extrapolate this trend from what “mixture of experts” to “mixture of architectures.” I’ll explain more after I cover the rest of the concepts in my initial list.
The Idea of Positional Embeddings in Transformers
Positional embeddings are an interesting innovation that captures more hierarchical information about the underlying data inputs. In previously popular models like RNNs, words were fed into the model sequentially. While this captured the word position, it wasn’t able to do it in the longer context way that a transformer can with positional embeddings.
You may know that text in converted into tokens, which are used to make embeddings, which then become a series of matrices for processing in a neural network. Positional embeddings are an additional matrix that contains the position of every word passed into the transformer model. The important thing for this discussion isn’t how this works, it’s that it’s an added level of meta information about the underlying data set. I think this meta information and the hierarchy it creates are important concepts in creating AGI that are under utlitized.
In fact, I’ve seen several startups creating other meta embeddings to augment the data passed into transformer models and they have had really great success so far.
The Explosion of AI Hardware
It is important to remember that GPUs were not designed to run AI models. That was a hack someone figured out, and NVIDIA embraced it. If you were designing AI compute from the ground up, you wouldn’t choose a GPU. That’s why the last decade has seen an explosion in AI hardware chips with some well funded companies like Cerebras, Groq, Sambanova, Tenstorrent, Lightmatter, Graphcore, Femtosense, Mythic, Sagence, and more.
These companies all outperform GPUs. They have raised more than $5B in venture capital. But as best I can tell, still have only about $500M in revenue between them all. New technology adoption always moves slower the larger the conceptual jump in a new category, so, I expect this to be slow. (Reminds me of watching the early days of NoSQL databases)
As model performance slows and companies concentrate more on price and power tradeoffs for performance, I expect to see several of these AI chip companies see strong adoption. I think different models and use cases will run on different chips, and we will move from Mixture-of-Experts models to Mixture-of-Architectures systems.
Why Neural Oscillation Needs A Compute Equivalent
Neural oscillation is a phenomenon where large electrical waves pass through the brain. The suspected use case is to synchronize neural activity across different brain regions, allowing for coordinated processing of information. At the moment, there is no equivalent in our AI systems, but I believe that making AGI will require us to come up with something that mirrors neural oscillations.
Symbolic Recursion from Godel, Escher, Bach
Some of you who are deep in AI have probably read the classic 1980s masterpiece Godel, Escher, Bach: An Eternal Golden Braid. In the book, Doug Hofstaedter ties together Kurt Godel’s incompleteness theorem, the art of MC Escher, and the music of Johann Sebastian Bach, into ideas about recursion, self-reference, intelligence, and AI. The most recent wave of neural network AI has not been kind to the symbolic logic approach, but that is changing. In the past 2 years I’ve seen multiple companies combining the two ideas into various neuro-symbolic approaches. Sometimes they use neural nets to generate symbolic logic code, or sometimes the systems are independent and are used together to get more accurate outputs and reduce hallucinations.
The key idea from the book, if I had to sum it up in one sentence, is this - intelligence is a symbolic logic system that, within the system contains a symbol that also represents the entire system itself.
How This All Ties Together For AGI
Let me lay out now how I think we will get to AGI using these concepts. First I think we move from mixture of experts models to mixture of architecture models, where AI systems contain different models running on different hardware. A simple use case for why this may start to happen is customer support chatbot costs. A financial institution that has different customers with different target support price points could find itself routing different requests to different models running on different hardware. As we develop more models to perform more tasks, the mixture-of-architectures approach will only build, and will require more systems level AI design and software to manage it all.
Once we live in a world where multiple models are running on multiple types of hardware, we are going to combine the remaining ideas above to provide a systems level view of what is going on in the system. The way I think this will work is that we will create systems level embeddings of what is happening in other models around the system, and use those for two things. One is to pass them to other models, so that model A knows what model B did one time stamp ago. The other is to unify them into an embedding of what is happening in the overall system at that point in time, and pass that system level embedding to the rest of the models. This combines points 4 and 5 above. A system level embedding serves a similar use case to a neural oscillation.
So imagine now we have a system with five models labeled A, B, C, D, and E. They each do different things. If model A takes in human voice, it won’t just process on the human voice input. Additional inputs will include embeddings of what models B, C, D, and E did a milli-second ago, and then there will also be a state-of-the-system representation that model A will digest.
This approach provides more context because a model now doesn’t just take its own normal input, but takes a bunch of systems level contextual input as well.
I know this post is a little complicated. In my head it all makes sense and seems like a logical flow but, I don’t expect everyone who reads it will feel that way. If you have questions, or areas of this idea need more clarification, please reach out. And as always, thanks for reading.