How to tokenize text using NLTK?

Published on Aug. 22, 2023, 12:19 p.m.

NLTK (Natural Language Toolkit) is a popular Python library used for natural language processing tasks, including text tokenization. Here is an example of how to use the NLTK to tokenize text:

First, make sure NLTK is installed by running the command !pip install nltk in your Python environment.

Then, in your Python code, import the NLTK and download its required resources:

import nltk
nltk.download('punkt')

Now you can use the nltk.word_tokenize() function to tokenize a given text into a list of words:

from nltk.tokenize import word_tokenize

text = "This is some example text."
tokens = word_tokenize(text)

print(tokens)

This will output ['This', 'is', 'some', 'example', 'text', '.'], which is a list of the individual words in text.

Note that there are other NLTK tokenizers available, including for sentence and regular-expression based tokenization. You can consult the NLTK documentation for more details.

Tags: