Natural Language Processing — Core Concepts

The Wrong Idea That Dominated NLP for 30 Years

When researchers first tackled language in the 1950s, the plan seemed obvious: write out the rules. Codify grammar. Build a dictionary. Chain it all together.

It failed. Spectacularly and repeatedly.

Not because the rules were wrong — they were mostly right. But language is an iceberg. The rules describe the visible tip. Underneath is a churning mass of idiom, ambiguity, implication, and cultural context that no rule system can fully capture. “Break a leg” isn’t an instruction. “I could eat a horse” isn’t a diet update. “Wicked good” means excellent in Boston.

By the 1990s, the field largely pivoted: forget the rules, feed it data. That shift — from hand-coded linguistics to statistical learning — is what eventually produced the NLP systems that run your phone, your inbox, and your search engine today.

What NLP Actually Tries to Do

NLP covers a surprisingly wide range of tasks. They all involve turning unstructured human language into something a computer can act on:

TaskWhat It IsReal Example
TokenizationSplit text into units (words, subwords)“isn’t” → [“is”, “n’t”]
Named Entity RecognitionIdentify people, places, companies”Apple” → [Company]
Sentiment AnalysisPositive/negative/neutral toneAmazon reviews → star prediction
Machine TranslationLanguage A → Language BDeepL, Google Translate
Text SummarizationLong → Short, meaning preservedNews briefings, TL;DRs
Question AnsweringAnswer questions from a documentSiri, Alexa, ChatGPT
Speech RecognitionAudio → textWhisper, Google’s voice search

Each of these was once a separate field with its own specialized tools. Since about 2018, large language models have gotten good enough to handle all of them — sometimes better than specialized systems that were purpose-built for one task.

How the Core Pipeline Works

A modern NLP pipeline moves through several stages before producing an output.

1. Preprocessing Raw text is messy. Lowercase everything. Strip HTML. Handle contractions. Decide whether to keep punctuation (it matters in some tasks). A preprocessing step turns “I LOVE this product!!!!” into something consistent before analysis.

2. Tokenization Computers don’t read words — they read tokens. Modern systems often use subword tokenization, which means “unbelievable” might become [“un”, “believ”, “able”]. This helps with rare words, typos, and multilingual text. GPT-4 uses roughly 100,000 distinct tokens.

3. Embedding (turning words into numbers) This is the trick that made modern NLP possible. Every token gets mapped to a vector — a list of numbers — where similar meanings end up close together in mathematical space. “King” and “Queen” are near each other. “King” minus “Man” plus “Woman” ends up close to “Queen.”

This isn’t programmed in. It emerges from training on massive amounts of text. That’s the insight that caused a quiet revolution when Google published Word2Vec in 2013.

4. Model Processing The embedded sequence passes through the model — usually a transformer these days — which builds a representation of what the text means. Crucially, this handles context: the word “bank” is processed differently in “river bank” vs “investment bank” because the surrounding words shape the embedding.

5. Output Layer Depending on the task, the model produces: a classification label, a generated token sequence, a number, a bounding box in a document, etc.

The Misconception That Gets Everyone

Most people assume sentiment analysis means checking for positive and negative words. “Love” = good, “hate” = bad, add them up, done.

This breaks immediately in the real world. “This movie isn’t terrible” is positive. “The food was amazing but the service ruined it” is mixed — which rating does it get? “Saved by the amazing soundtrack” is damning the movie while praising part of it.

Real sentiment analysis uses context, negation handling, and aspect-based analysis (how does the reviewer feel about each specific thing they mentioned?). Getting this right is why Amazon, Yelp, and Google still invest heavily in NLP research.

The Turning Point: Transformers (2017)

Until 2017, NLP models processed text sequentially — left to right, one word at a time. Long sentences were a problem because context from the beginning of the sentence got diluted by the end.

A Google paper called Attention Is All You Need proposed a different architecture: transformers. Instead of reading sequentially, transformers look at every word in relation to every other word simultaneously. The word “it” in “The animal didn’t cross the road because it was too tired” — does “it” refer to the animal or the road? A transformer can figure that out by checking both candidates at once.

Transformers scaled dramatically better than previous architectures. Within two years, BERT, GPT-2, and XLNet had shattered benchmarks that had stood for years. By 2020, GPT-3 demonstrated that a language model trained on enough text could do tasks nobody explicitly trained it for — translation, code generation, logic puzzles — just by learning language patterns deeply enough.

Common NLP Failure Modes

  • Distributional shift — a model trained on Wikipedia-style text struggles with Twitter slang or medical notes
  • Ambiguity — “I saw her duck” and “I watched her crouch” look the same to a parser reading left to right
  • Hallucination — generative models will confidently produce fluent text that contains fabricated facts
  • Bias amplification — if the training corpus contains biased language, the model learns and often amplifies it

One Thing to Remember

The field of NLP did a 180: it abandoned hand-coded rules for statistical pattern learning. That shift — combined with the transformer architecture in 2017 — is why language AI went from “barely functional” to “passes the bar exam” in under a decade.

techainlplanguagemachine-learningtransformers

See Also

  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
  • Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'