3.9. Cleaning Columns#

import pandas as pd
import numpy as np
df=pd.DataFrame({
'First Name':['Sahil','Sonia','Sourav','Vishal'],
'Age^':[10,20,30,40],
'Ge%nder%':['M','F','M','M'],
'City':['J','K','L','P'],
'Place of      Work':[True,False,False,True],
 'Place-of-Work':[True,False,False,True]
}
)
df
First Name Age^ Ge%nder% City Place of Work Place-of-Work
0 Sahil 10 M J True True
1 Sonia 20 F K False False
2 Sourav 30 M L False False
3 Vishal 40 M P True True

3.9.1. Remove spaces from names#

df2=df.copy(deep=True)
df2.columns=df2.columns.str.replace(' +','_')
/var/folders/y0/7jpzlk652q19ghz27zs91vt40000gp/T/ipykernel_29728/95460965.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  df2.columns=df2.columns.str.replace(' +','_')
df2
First_Name Age^ Ge%nder% City Place_of_Work Place-of-Work
0 Sahil 10 M J True True
1 Sonia 20 F K False False
2 Sourav 30 M L False False
3 Vishal 40 M P True True

3.9.2. Remove Special Characters from Names#

df2=df.copy(deep=True)
df2
First Name Age^ Ge%nder% City Place of Work Place-of-Work
0 Sahil 10 M J True True
1 Sonia 20 F K False False
2 Sourav 30 M L False False
3 Vishal 40 M P True True
# df2.columns = df2.columns.str.replace(' +', '_') # + means n numbers

# Rename specific columns explicitly if needed
df2.rename({'Ge%nder%':'Gender'},axis=1,inplace=True)

# If you want to do this step only if column of this name exists,then wrap it in if condition:
if 'Ge%nder%' in df2.columns:
    df2.rename({'Ge%nder%':'Gender'},axis=1,inplace=True)
df2
First Name Age^ Gender City Place of Work Place-of-Work
0 Sahil 10 M J True True
1 Sonia 20 F K False False
2 Sourav 30 M L False False
3 Vishal 40 M P True True
df2.columns=df2.columns.str.replace('[^A-Za-z]+', '_')
/var/folders/y0/7jpzlk652q19ghz27zs91vt40000gp/T/ipykernel_29728/3413768078.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  df2.columns=df2.columns.str.replace('[^A-Za-z]+', '_')
df2
First_Name Age_ Gender City Place_of_Work Place_of_Work
0 Sahil 10 M J True True
1 Sonia 20 F K False False
2 Sourav 30 M L False False
3 Vishal 40 M P True True
df2.columns = df2.columns.str.strip('_')    # used to remve specific characters from first and last position 
df2
First_Name Age Gender City Place_of_Work Place_of_Work
0 Sahil 10 M J True True
1 Sonia 20 F K False False
2 Sourav 30 M L False False
3 Vishal 40 M P True True

3.9.2.1. Patterns#

  • [^A-Za-z]+ - If you just want to ignore A-Z or a-z alphabets

  • [^A-Za-z\s]+ - If you just want to ignore A-Z or a-z alphabets and spaces

  • [^A-Za-z0-9]+ If you wantto ignore alphabets + numbers

3.9.2.2. Note#

  • All the string functions (like replace) can be chained as well

    • df[‘col’].str.replace(‘%’,’’).str.replace(‘$’,’’)