The term “LLM” is a misnomer.

Dec 04, 2024

Sometime last year, I noticed AI-adjacent (or “AI curious”) folks using the term “LLM” in odd ways:

Here’s Tyler Cohen asking Reid Hoffman if LLMs could be applied to recordings of dolphin sounds to decipher their language. A non-standard use (to mean unsupervised machine translation), but generally in the domain of language; okay.
Here’s Dwarkesh asking Demis Hassabis — “AlphaGo … was a pretty expensive system because you … had to run an LLM on each node of the tree”. An LLM? It was ResNet processing board images.

I suspect some people just mean “a (modern) AI system” when they say “LLM”. But they may be onto something — the term “Large Language Model” is indeed a misnomer.

In the technical sense, a language model is a probabilistic model p(x_1, … x_T) of language allowing all the nice things that probabilistic models let one do (sample, marginalize, condition) — auto-complete sequences and express uncertainty over such completions.

However, the term “LLM’ has become a shorthand for auto-regressive symbol-sequence models, built with transformers, trained with SGD and self-supervised learning (yes, quite a mouthful, hence the natural human tendency to use a shorthand).

The “language” part in LLMs, particularly the “natural language” part appears to be purely incidental. LLMs were discovered by NLP community because that domain had the right ecological conditions for the researchers to stumble into this configuration.

But they are a kind of “universal” learner (where “universal” today is restricted to symbol sequences). And they are likely pointing to an underlying law of nature about the learnability of symbol sequences generated by nature.

People have suggested alternative names — “Autoregressive transformers” (Andrej Karpathy) and “Neural Sequence Models” (Richard Socher).

I like “Symbol Sequence Models” (SSMs), but the more important question is — what kinds of sequences are learnable?

My read of the situation today is that we’re seeing success in domains with discrete or symbolic sequences (language, code, protein) and not over continuous variables in high-dimensions (pixels, low-level motor control). Specific thoughts on scaling pre-training in vision here. Note: this isn’t an argument about a ceiling or an impossibility claim; it’s simply an observation about the status quo.

Will this continue to be the case? Topic for a longer post.

Dhruv’s Substack

Discussion about this post