How to remove repeated characters in pandas?

Published on Aug. 22, 2023, 12:18 p.m.

To remove repeated characters in a Pandas Series of strings, you can use the Series.str.replace() method with a regular expression that uses backreferences. For example:

import pandas as pd

# Create a sample series
s = pd.Series(['aabbcc', 'ddddd', 'effffee', 'gggghhhh'])

# Remove repeated characters
s = s.str.replace(r'(\w)\1+', r'\1')

print(s)

The regular expression pattern (\w)\1+ matches any character that is immediately followed by one or more copies of itself. The parentheses create a capturing group, which can be referred to later using a backreference. The backreference \1 matches the same content as the first capturing group, effectively removing the repeated characters.

After running this code, the Series s will contain the following values:

0    abc
1      d
2    efe
3    ghh
dtype: object

Notice that the repeated characters have been removed from each string in the Series.

Note: This method only removes consecutive duplicates, so it will not remove duplicates that are separated by other characters.

Tags: