Stemming In Natural Language Processing

Stemming is the process for reducing inflected words to their word stem (base form). A word stem is part of a word. It is sort of a normalization idea, but linguistic.

For example, the stem of the word waiting is "wait".

Stemming
The machine doesn’t understand the English grammar and will not differentiate the verb wait with waited, waiting and waits. So, the machine will treat all these words differently but all of these have the same meaning. Now, think about the count of words you are familiar with and now add the number of words by which it increases when you consider the tenses and all of the word forms.

By stemming them, it groups the frequencies of different inflections to just one term — in this case, wait.

This is the reason why stem is used to shorten the lookup and normalize sentences.

We have a lot of stemming algorithms like Porter, Porter2, and Lovins stemming algorithms for English. But one of the most popular Stemming algorithms is Porter stemming and we will be using the same.

First of all, we will be defining the stemmer:

  1. from nltk.stem import PorterStemmer  
  2. from nltk.tokenize import sent_tokenize, word_tokenize  
  3.   
  4. stemmer = PorterStemmer()  

Now, let's choose some words with a similar stem, like:

  1. example_words = ["wait","waited","waiting","waits"]  

We can very easily stem by doing something like,

for w in example_words:

  1. print(stemmer.stem(w))  

Our output,

wait
wait
wait
wait

We have seen very simple stemming from words. Let’s try to perform the stemming from the sentence.

  1. text = "I hate waiting in long lines. They waited at the train station together. The field marshal looks on and waits for letters addressed to him."  
  2. tokenized = word_tokenize(text)  

For words in tokenized:

  1. print(stemmer.stem(words))  

Now our result is:

I
Hate
Wait
In
Long
Lines.
They
Wait
At
The
Train
Station
Together
The
Field
Marshal
Looks
On
And
Wait
For
Letters
Addressed
To
Him.

G
M
T
 
Text-to-speech function is limited to 200 characters
X

Build smarter apps with Machine Learning, Bots, Cognitive Services - Start free.

Start Learning Now