What is Lemmatization in Natural Language Processing(NLP)

Praveen Kumar
Jun 03
528
0
11

Article

What is Lemmatization?

Lemmatization in Natural Language Processing (NLP) is the process of reducing a word to its base or dictionary form, known as a lemma.

🔍 Example:

Word Form	Lemma
running	run
better	good
was	be

So, in a sentence like:

“He was running faster than anyone.”

Lemmatization would convert:

• “was” → “be”

• “running” → “run”

🧠 Why Lemmatization Matters

Lemmatization helps NLP systems understand the core meaning of words by stripping away inflections, tenses, or comparative forms. This improves:

Search accuracy (e.g., searching “run” also finds “ran” or “running”)
Text classification
Sentiment analysis
Information retrieval

🔁 Lemmatization vs. Stemming

Feature	Stemming	Lemmatization
Approach	Heuristic (rule-based)	Linguistic (dictionary-based)
Output	Crude root form	Real word (lemma)
Example	“running” → “runn”	“running” → “run”
Speed	Faster	Slower (but more accurate)

🛠️ How It Works

Lemmatization uses:

Part of Speech (POS) tagging
Lexicons or dictionaries
Morphological analysis

Example using Python’s NLTK:

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v"))  # Output: run
print(lemmatizer.lemmatize("better", pos="a"))   # Output: good

Implementing Lemmatization in spaCy and other libraries

🔍 1. spaCy: Industrial-Strength NLP

spaCy’s lemmatizer is fast, accurate, and context-aware, designed for real-world usage.

🔧 How It Works

Uses a rule-based lemmatizer, backed by lookup tables and exception lists.
It considers Part of Speech (POS) to select the correct lemma.
Includes language-specific lemmatizers (e.g., English, German, French).

🧪 Example

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The children were running faster than their parents.")
for token in doc:
    print(f"{token.text} → {token.lemma_} ({token.pos_})")

Output

The → the (DET)

children → child (NOUN)

were → be (AUX)

running → run (VERB)

faster → fast (ADV)

⚙️ Under the Hood

Rules in lemmatizer.py and lookups.py
Custom exception handling for irregular forms
Uses lexical attributes and POS mappings from the en_core_web_* model

Pros:

Highly accurate
Multi-language support
Integrates tightly with other spaCy features like NER and dependency parsing

🧠 2. NLTK: Academic & Educational Toolkit

NLTK provides two lemmatizers: WordNetLemmatizer and ISRI Stemmer (for Arabic).

WordNetLemmatizer

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos="v")  # Output: 'run'
lemmatizer.lemmatize("better", pos="a")   # Output: 'good'

How It Works

Uses WordNet, a lexical database
Requires correct POS tag to be most effective
Doesn’t do any sentence-level analysis

🔁 POS Mapping Needed

from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

Pros:

Simple and fast
Good for educational purposes or light NLP tasks

Cons:

Not context-aware
Can produce inaccurate lemmas without correct POS

💬 3. TextBlob: Friendly API on Top of NLTK

TextBlob wraps NLTK and provides a higher-level API.

from textblob import Word
w = Word("running")
print(w.lemmatize("v"))  # Output: 'run'

How It Works

Internally uses NLTK’s WordNetLemmatizer
Provides easier syntax but doesn’t add accuracy
Good for prototyping, sentiment analysis, or light preprocessing

🧪 4. Other Libraries (Quick Mentions)

🔡 Gensim

Not a lemmatizer, but often paired with NLTK for preprocessing
Tokenizes, filters stop words, but expects external lemmatization

🌐 Stanza (by Stanford NLP)

Deep learning-based
Uses neural models trained on Universal Dependencies
Provides lemmatization with higher accuracy on many languages

import stanza
stanza.download('en')
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma')
doc = nlp("The children were running.")
for sent in doc.sentences:
    for word in sent.words:
        print(f"{word.text} → {word.lemma}")

🔍 Comparison Table

Feature/Aspect	spaCy	NLTK + WordNet	TextBlob	Stanza
Context-aware	✅ Yes	❌ No	❌ No	✅ Yes
POS-sensitive	✅ Yes	✅ Yes	✅ Yes	✅ Yes
Language support	✅ Many	⚠️ Mostly English	⚠️ Mostly English	✅ Many (UD models)
Speed	⚡ Fast	⚡ Fast	⚡ Fast	🐢 Slower (neural)
Accuracy	🎯 High (rule-based)	🎯 Medium (lexical)	🎯 Medium	🎯 Very High (neural)

Use Case Best Tool

Real-world applications	spaCy
Educational or lexical exploration	NLTK
Prototyping with simple syntax	TextBlob
High-accuracy, multilingual projects	Stanza

Building a custom medical lemmatizer

Creating a custom lemmatizer for a niche domain like medical text is a powerful move—especially when general-purpose lemmatizers (like spaCy or WordNet) miss domain-specific terms such as “diagnoses” or “hypoglycemic”.

Here’s a step-by-step guide to building a custom medical lemmatizer, using spaCy, domain-specific vocabulary, and optional ML for edge cases.

🧩 1. Why You Need a Custom Lemmatizer in Medical NLP

Term	General Lemma	Desired Medical Lemma
diagnoses	diagnose	diagnosis
hypoglycemic	hypoglycem	hypoglycemia
myocardial	myocard	myocardium

Off-the-shelf lemmatizers can’t handle:

Irregular medical inflections
Abbreviations like “BP” or “CXR”
Root forms that don’t exist in general English (e.g. “neoplasia” from “neoplastic”)

🏗️ 2. Tools and Frameworks

spaCy: for base NLP pipeline
Custom lookup table: for domain-specific lemma mappings
(Optional) ML classifier: for ambiguous words
UMLS or SNOMED CT: for canonical forms (if you have access)

⚙️ 3. Setup a Custom Rule-Based Lemmatizer in spaCy

import spacy
from spacy.language import Language
from spacy.lookups import Lookups

# Step 1: Load base model

nlp = spacy.load("en_core_web_sm")

# Step 2: Create custom lookup table

lookups = Lookups()
custom_lemmas = {
    "diagnoses": "diagnosis",
    "hypoglycemic": "hypoglycemia",
    "myocardial": "myocardium",
    "tachypneic": "tachypnea",
    "dyspneic": "dyspnea"
}
lookups.add_table("lemma_rules", {"noun": custom_lemmas})

# Step 3: Add the custom lemmatizer to spaCy

from spacy.pipeline.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer(lookups)
nlp.get_pipe("lemmatizer").overwrite = True
nlp.get_pipe("lemmatizer").mode = "lookup"
nlp.get_pipe("lemmatizer").lookups = lookups

# Test

doc = nlp("The patient is hypoglycemic and tachypneic.")
for token in doc:
    print(f"{token.text} → {token.lemma_}")

Output

The → the

patient → patient

is → be

hypoglycemic → hypoglycemia

and → and

tachypneic → tachypnea

. → .

📚 4. Extend It with a Medical Lexicon

Build or scrape a lemma dictionary using:

UMLS Metathesaurus: Contains synonyms and canonical forms
SNOMED CT: Ontology for medical terminology
SciSpacy: Has entity linking and domain-specific models

SciSpacy Example

import scispacy
import spacy
from scispacy.linking import EntityLinker
nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
doc = nlp("The patient presented with dyspnea and tachypnea.")
for ent in doc.ents:
    print(ent.text, "→", ent._.umls_ents)