How to identify named entities using NLTK?
Published on Aug. 22, 2023, 12:19 p.m.
To identify named entities using NLTK in Python, you can follow these steps:
- Install the NLTK library if it’s not already installed in your system.
pip install nltk
- Import the necessary libraries and download the punkt and averaged_perceptron_tagger corpora.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
- Tokenize your text into sentences and words and apply part-of-speech tagging.
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
text = "Barack Obama was the 44th President of the United States. He served two terms from 2009 to 2017."
sentences = sent_tokenize(text)
for sentence in sentences:
words = word_tokenize(sentence)
tagged = pos_tag(words)
- Apply named entity recognition using the
ne_chunk()
function.
from nltk import ne_chunk
for sentence in sentences:
words = word_tokenize(sentence)
tagged = pos_tag(words)
named_entities = ne_chunk(tagged)
Here, named_entities
will contain the named entities recognized by NLTK in a tree structure.
Alternatively, You can use the stanford-ner
module to identify named entities.
from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('<path_to_model>', '<path_to_jar>')
text = "Barack Obama was the 44th President of the United States. He served two terms from 2009 to 2017."
sentences = sent_tokenize(text)
for sentence in sentences:
words = word_tokenize(sentence)
tagged = st.tag(words)
Either way, the resulting named_entities
or ‘tagged’ will contain the original text with all of the named entities identified.
Note that in order to use the StanfordNERTagger
, you will need to download the Stanford Named Entity Recognizer jars and models, which can be found on the official Stanford NLP website.