Natural Language Processing — Core Concepts

From tokenization to transformers, here's how NLP actually works — and why 'just teach it grammar' turned out to be completely the wrong approach.

The Wrong Idea That Dominated NLP for 30 Years

When researchers first tackled language in the 1950s, the plan seemed obvious: write out the rules. Codify grammar. Build a dictionary. Chain it all together.

It failed. Spectacularly and repeatedly.

Not because the rules were wrong — they were mostly right. But language is an iceberg. The rules describe the visible tip. Underneath is a churning mass of idiom, ambiguity, implication, and cultural context that no rule system can fully capture. “Break a leg” isn’t an instruction. “I could eat a horse” isn’t a diet update. “Wicked good” means excellent in Boston.

By the 1990s, the field largely pivoted: forget the rules, feed it data. That shift — from hand-coded linguistics to statistical learning — is what eventually produced the NLP systems that run your phone, your inbox, and your search engine today.

What NLP Actually Tries to Do

NLP covers a surprisingly wide range of tasks. They all involve turning unstructured human language into something a computer can act on:

Task	What It Is	Real Example
Tokenization	Split text into units (words, subwords)	“isn’t” → [“is”, “n’t”]
Named Entity Recognition	Identify people, places, companies	”Apple” → [Company]
Sentiment Analysis	Positive/negative/neutral tone	Amazon reviews → star prediction
Machine Translation	Language A → Language B	DeepL, Google Translate
Text Summarization	Long → Short, meaning preserved	News briefings, TL;DRs
Question Answering	Answer questions from a document	Siri, Alexa, ChatGPT
Speech Recognition	Audio → text	Whisper, Google’s voice search

Each of these was once a separate field with its own specialized tools. Since about 2018, large language models have gotten good enough to handle all of them — sometimes better than specialized systems that were purpose-built for one task.

How the Core Pipeline Works

A modern NLP pipeline moves through several stages before producing an output.

1. Preprocessing Raw text is messy. Lowercase everything. Strip HTML. Handle contractions. Decide whether to keep punctuation (it matters in some tasks). A preprocessing step turns “I LOVE this product!!!!” into something consistent before analysis.

2. Tokenization Computers don’t read words — they read tokens. Modern systems often use subword tokenization, which means “unbelievable” might become [“un”, “believ”, “able”]. This helps with rare words, typos, and multilingual text. GPT-4 uses roughly 100,000 distinct tokens.

3. Embedding (turning words into numbers) This is the trick that made modern NLP possible. Every token gets mapped to a vector — a list of numbers — where similar meanings end up close together in mathematical space. “King” and “Queen” are near each other. “King” minus “Man” plus “Woman” ends up close to “Queen.”

This isn’t programmed in. It emerges from training on massive amounts of text. That’s the insight that caused a quiet revolution when Google published Word2Vec in 2013.

4. Model Processing The embedded sequence passes through the model — usually a transformer these days — which builds a representation of what the text means. Crucially, this handles context: the word “bank” is processed differently in “river bank” vs “investment bank” because the surrounding words shape the embedding.

5. Output Layer Depending on the task, the model produces: a classification label, a generated token sequence, a number, a bounding box in a document, etc.

The Misconception That Gets Everyone

Most people assume sentiment analysis means checking for positive and negative words. “Love” = good, “hate” = bad, add them up, done.

This breaks immediately in the real world. “This movie isn’t terrible” is positive. “The food was amazing but the service ruined it” is mixed — which rating does it get? “Saved by the amazing soundtrack” is damning the movie while praising part of it.

Real sentiment analysis uses context, negation handling, and aspect-based analysis (how does the reviewer feel about each specific thing they mentioned?). Getting this right is why Amazon, Yelp, and Google still invest heavily in NLP research.

The Turning Point: Transformers (2017)

Until 2017, NLP models processed text sequentially — left to right, one word at a time. Long sentences were a problem because context from the beginning of the sentence got diluted by the end.

A Google paper called Attention Is All You Need proposed a different architecture: transformers. Instead of reading sequentially, transformers look at every word in relation to every other word simultaneously. The word “it” in “The animal didn’t cross the road because it was too tired” — does “it” refer to the animal or the road? A transformer can figure that out by checking both candidates at once.

Transformers scaled dramatically better than previous architectures. Within two years, BERT, GPT-2, and XLNet had shattered benchmarks that had stood for years. By 2020, GPT-3 demonstrated that a language model trained on enough text could do tasks nobody explicitly trained it for — translation, code generation, logic puzzles — just by learning language patterns deeply enough.

Common NLP Failure Modes

Distributional shift — a model trained on Wikipedia-style text struggles with Twitter slang or medical notes
Ambiguity — “I saw her duck” and “I watched her crouch” look the same to a parser reading left to right
Hallucination — generative models will confidently produce fluent text that contains fabricated facts
Bias amplification — if the training corpus contains biased language, the model learns and often amplifies it

One Thing to Remember

The field of NLP did a 180: it abandoned hand-coded rules for statistical pattern learning. That shift — combined with the transformer architecture in 2017 — is why language AI went from “barely functional” to “passes the bar exam” in under a decade.

techainlplanguagemachine-learningtransformers