How to tokenize text using NLTK?
Published on Aug. 22, 2023, 12:19 p.m.
NLTK (Natural Language Toolkit) is a popular Python library used for natural language processing tasks, including text tokenization. Here is an example of how to use the NLTK to tokenize text:
First, make sure NLTK is installed by running the command !pip install nltk
in your Python environment.
Then, in your Python code, import the NLTK and download its required resources:
import nltk
nltk.download('punkt')
Now you can use the nltk.word_tokenize()
function to tokenize a given text into a list of words:
from nltk.tokenize import word_tokenize
text = "This is some example text."
tokens = word_tokenize(text)
print(tokens)
This will output ['This', 'is', 'some', 'example', 'text', '.']
, which is a list of the individual words in text
.
Note that there are other NLTK tokenizers available, including for sentence and regular-expression based tokenization. You can consult the NLTK documentation for more details.