AI  

What is Lemmatization in Natural Language Processing(NLP)

What is Lemmatization?

Lemmatization in Natural Language Processing (NLP) is the process of reducing a word to its base or dictionary form, known as a lemma.

🔍 Example:

Word Form Lemma
running run
better good
was be

So, in a sentence like:

“He was running faster than anyone.”

Lemmatization would convert:

• “was” → “be”

• “running” → “run”

🧠 Why Lemmatization Matters

Lemmatization helps NLP systems understand the core meaning of words by stripping away inflections, tenses, or comparative forms. This improves:

  • Search accuracy (e.g., searching “run” also finds “ran” or “running”)
  • Text classification
  • Sentiment analysis
  • Information retrieval

🔁 Lemmatization vs. Stemming

Feature Stemming Lemmatization
Approach Heuristic (rule-based) Linguistic (dictionary-based)
Output Crude root form Real word (lemma)
Example “running” → “runn” “running” → “run”
Speed Faster Slower (but more accurate)

🛠️ How It Works

Lemmatization uses:

  • Part of Speech (POS) tagging
  • Lexicons or dictionaries
  • Morphological analysis

Example using Python’s NLTK:

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v"))  # Output: run
print(lemmatizer.lemmatize("better", pos="a"))   # Output: good

Implementing Lemmatization in spaCy and other libraries

🔍 1. spaCy: Industrial-Strength NLP

spaCy’s lemmatizer is fast, accurate, and context-aware, designed for real-world usage.

🔧 How It Works

  • Uses a rule-based lemmatizer, backed by lookup tables and exception lists.
  • It considers Part of Speech (POS) to select the correct lemma.
  • Includes language-specific lemmatizers (e.g., English, German, French).

🧪 Example

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The children were running faster than their parents.")
for token in doc:
    print(f"{token.text} → {token.lemma_} ({token.pos_})")

Output

The → the (DET)

children → child (NOUN)

were → be (AUX)

running → run (VERB)

faster → fast (ADV)

⚙️ Under the Hood

  • Rules in lemmatizer.py and lookups.py
  • Custom exception handling for irregular forms
  • Uses lexical attributes and POS mappings from the en_core_web_* model

Pros:

  • Highly accurate
  • Multi-language support
  • Integrates tightly with other spaCy features like NER and dependency parsing

🧠 2. NLTK: Academic & Educational Toolkit

NLTK provides two lemmatizers: WordNetLemmatizer and ISRI Stemmer (for Arabic).

WordNetLemmatizer

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos="v")  # Output: 'run'
lemmatizer.lemmatize("better", pos="a")   # Output: 'good'

How It Works

  • Uses WordNet, a lexical database
  • Requires correct POS tag to be most effective
  • Doesn’t do any sentence-level analysis

🔁 POS Mapping Needed

from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

Pros:

  • Simple and fast
  • Good for educational purposes or light NLP tasks

Cons:

  • Not context-aware
  • Can produce inaccurate lemmas without correct POS

💬 3. TextBlob: Friendly API on Top of NLTK

TextBlob wraps NLTK and provides a higher-level API.

from textblob import Word
w = Word("running")
print(w.lemmatize("v"))  # Output: 'run'

How It Works

  • Internally uses NLTK’s WordNetLemmatizer
  • Provides easier syntax but doesn’t add accuracy
  • Good for prototyping, sentiment analysis, or light preprocessing

🧪 4. Other Libraries (Quick Mentions)

🔡 Gensim

  • Not a lemmatizer, but often paired with NLTK for preprocessing
  • Tokenizes, filters stop words, but expects external lemmatization

🌐 Stanza (by Stanford NLP)

  • Deep learning-based
  • Uses neural models trained on Universal Dependencies
  • Provides lemmatization with higher accuracy on many languages
import stanza
stanza.download('en')
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma')
doc = nlp("The children were running.")
for sent in doc.sentences:
    for word in sent.words:
        print(f"{word.text} → {word.lemma}")

🔍 Comparison Table

Feature/Aspect spaCy NLTK + WordNet TextBlob Stanza
Context-aware ✅ Yes ❌ No ❌ No ✅ Yes
POS-sensitive ✅ Yes ✅ Yes ✅ Yes ✅ Yes
Language support ✅ Many ⚠️ Mostly English ⚠️ Mostly English ✅ Many (UD models)
Speed ⚡ Fast ⚡ Fast ⚡ Fast 🐢 Slower (neural)
Accuracy 🎯 High (rule-based) 🎯 Medium (lexical) 🎯 Medium 🎯 Very High (neural)

Use Case Best Tool

Real-world applications spaCy
Educational or lexical exploration NLTK
Prototyping with simple syntax TextBlob
High-accuracy, multilingual projects Stanza

Building a custom medical lemmatizer

Creating a custom lemmatizer for a niche domain like medical text is a powerful move—especially when general-purpose lemmatizers (like spaCy or WordNet) miss domain-specific terms such as “diagnoses” or “hypoglycemic”.

Here’s a step-by-step guide to building a custom medical lemmatizer, using spaCy, domain-specific vocabulary, and optional ML for edge cases.

🧩 1. Why You Need a Custom Lemmatizer in Medical NLP

Term General Lemma Desired Medical Lemma
diagnoses diagnose diagnosis
hypoglycemic hypoglycem hypoglycemia
myocardial myocard myocardium

Off-the-shelf lemmatizers can’t handle:

  • Irregular medical inflections
  • Abbreviations like “BP” or “CXR”
  • Root forms that don’t exist in general English (e.g. “neoplasia” from “neoplastic”)

🏗️ 2. Tools and Frameworks

  • spaCy: for base NLP pipeline
  • Custom lookup table: for domain-specific lemma mappings
  • (Optional) ML classifier: for ambiguous words
  • UMLS or SNOMED CT: for canonical forms (if you have access)

⚙️ 3. Setup a Custom Rule-Based Lemmatizer in spaCy

import spacy
from spacy.language import Language
from spacy.lookups import Lookups

# Step 1: Load base model

nlp = spacy.load("en_core_web_sm")

# Step 2: Create custom lookup table

lookups = Lookups()
custom_lemmas = {
    "diagnoses": "diagnosis",
    "hypoglycemic": "hypoglycemia",
    "myocardial": "myocardium",
    "tachypneic": "tachypnea",
    "dyspneic": "dyspnea"
}
lookups.add_table("lemma_rules", {"noun": custom_lemmas})

# Step 3: Add the custom lemmatizer to spaCy

from spacy.pipeline.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer(lookups)
nlp.get_pipe("lemmatizer").overwrite = True
nlp.get_pipe("lemmatizer").mode = "lookup"
nlp.get_pipe("lemmatizer").lookups = lookups

# Test

doc = nlp("The patient is hypoglycemic and tachypneic.")
for token in doc:
    print(f"{token.text} → {token.lemma_}")

Output

The → the  

patient → patient  

is → be  

hypoglycemic → hypoglycemia  

and → and  

tachypneic → tachypnea  

. → .

📚 4. Extend It with a Medical Lexicon

Build or scrape a lemma dictionary using:

  • UMLS Metathesaurus: Contains synonyms and canonical forms
  • SNOMED CT: Ontology for medical terminology
  • SciSpacy: Has entity linking and domain-specific models

SciSpacy Example

import scispacy
import spacy
from scispacy.linking import EntityLinker
nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
doc = nlp("The patient presented with dyspnea and tachypnea.")
for ent in doc.ents:
    print(ent.text, "→", ent._.umls_ents)

C# Corner started as an online community for software developers in 1999.