How to identify named entities using NLTK?

Published on Aug. 22, 2023, 12:19 p.m.

To identify named entities using NLTK in Python, you can follow these steps:

  1. Install the NLTK library if it’s not already installed in your system.
pip install nltk
  1. Import the necessary libraries and download the punkt and averaged_perceptron_tagger corpora.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
  1. Tokenize your text into sentences and words and apply part-of-speech tagging.
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
text = "Barack Obama was the 44th President of the United States. He served two terms from 2009 to 2017."
sentences = sent_tokenize(text)
for sentence in sentences:
words = word_tokenize(sentence)
tagged = pos_tag(words)
  1. Apply named entity recognition using the ne_chunk() function.
from nltk import ne_chunk
for sentence in sentences:
words = word_tokenize(sentence)
tagged = pos_tag(words)
named_entities = ne_chunk(tagged)

Here, named_entities will contain the named entities recognized by NLTK in a tree structure.

Alternatively, You can use the stanford-ner module to identify named entities.

from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('<path_to_model>', '<path_to_jar>')
text = "Barack Obama was the 44th President of the United States. He served two terms from 2009 to 2017."
sentences = sent_tokenize(text)
for sentence in sentences:
    words = word_tokenize(sentence)
    tagged = st.tag(words)

Either way, the resulting named_entities or ‘tagged’ will contain the original text with all of the named entities identified.

Note that in order to use the StanfordNERTagger, you will need to download the Stanford Named Entity Recognizer jars and models, which can be found on the official Stanford NLP website.

Tags: