How to remove URLs in pandas?

Published on Aug. 22, 2023, 12:18 p.m.

To remove URLs from a pandas DataFrame column, you can use regular expressions to match and replace URLs with an empty string.

Here is an example of how to remove URLs from a pandas DataFrame column:

import pandas as pd
import re

# Create a DataFrame with URLs in a column
df = pd.DataFrame({'text': ['This is a text with a URL https://www.example.com in it', 'This is another text with a URL https://www.example.org in it']})

# Remove URLs using regular expressions
df['text'] = df['text'].apply(lambda x: re.sub(r'http\S+', '', x))

# Print the updated DataFrame without URLs
print(df)

In this code, we create a pandas DataFrame df with a URL in the text column of each row. We then apply a regular expression using re.sub() to each element of the column using .apply(). The regular expression r'http\S+' matches any sequence of characters that starts with http, followed by one or more non-space characters (\S+), and replaces it with an empty string ''. The result is then assigned back to the text column of df. The updated DataFrame is then printed without the URLs.

Note that this regular expression only matches URLs that start with http, but if your data includes URLs with other protocols (such as https, ftp, or mailto), you may need to modify the regular expression accordingly.

Additionally, if your data includes URLs that are contained within HTML tags, you may want to remove the HTML tags first using the method described in my previous answer before removing the URLs.

Tags: