How to remove repeated characters in pandas?
Published on Aug. 22, 2023, 12:18 p.m.
To remove repeated characters in a Pandas Series of strings, you can use the Series.str.replace() method with a regular expression that uses backreferences. For example:
import pandas as pd
# Create a sample series
s = pd.Series(['aabbcc', 'ddddd', 'effffee', 'gggghhhh'])
# Remove repeated characters
s = s.str.replace(r'(\w)\1+', r'\1')
print(s)
The regular expression pattern (\w)\1+
matches any character that is immediately followed by one or more copies of itself. The parentheses create a capturing group, which can be referred to later using a backreference. The backreference \1
matches the same content as the first capturing group, effectively removing the repeated characters.
After running this code, the Series s
will contain the following values:
0 abc
1 d
2 efe
3 ghh
dtype: object
Notice that the repeated characters have been removed from each string in the Series.
Note: This method only removes consecutive duplicates, so it will not remove duplicates that are separated by other characters.