How to remove HTML tags in pandas?
Published on Aug. 22, 2023, 12:18 p.m.
To remove HTML tags from a pandas DataFrame, you can use the BeautifulSoup
library’s .get_text()
method to extract the plain text from the HTML tags.
Here is an example of how to remove HTML tags from a pandas DataFrame column:
import pandas as pd
from bs4 import BeautifulSoup
# Create a DataFrame with HTML tags in a column
df = pd.DataFrame({'text': ['<p>This is some text</p>', '<span>This is some more text</span>']})
# Apply BeautifulSoup to extract plain text from HTML tags
df['text'] = df['text'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
# Print the updated DataFrame without HTML tags
print(df)
In this code, we create a pandas DataFrame df
with an HTML tag in the text
column of each row. We then apply BeautifulSoup()
to each element of the column using .apply()
and pass the HTML parser html.parser
. Finally, we use .get_text()
to extract the plain text from the HTML tags, and assign the result back to the text
column of df
. The updated DataFrame is then printed without the HTML tags.
Note that if your DataFrame has a lot of HTML tags, this may be a time-consuming operation. You may want to consider applying this operation only to specific columns or to a subset of your DataFrame, if appropriate.
Additionally, if the HTML tags contain other nested tags that you want to preserve, you can use .contents
instead of .get_text()
, which returns a list of all the nested elements within the tag.
import pandas as pd
from bs4 import BeautifulSoup
# Create a DataFrame with nested HTML tags in a column
df = pd.DataFrame({'text': ['<p>This is <b>bold</b> text</p>', '<span>This is <i>italicized</i> text</span>']})
# Apply BeautifulSoup to extract nested HTML tags
df['text'] = df['text'].apply(lambda x: BeautifulSoup(x, 'html.parser').contents)
# Print the updated DataFrame with nested HTML tags
print(df)
In this code, we create a pandas DataFrame df
with a nested HTML tag in the text
column of each row. We then apply BeautifulSoup()
to each element of the column using .apply()
and pass the HTML parser html.parser
. Finally, we use .contents
to extract the list of nested elements within