Cleaning Columns

Contents

3.9. Cleaning Columns#

import pandas as pd
import numpy as np

df=pd.DataFrame({
'First Name':['Sahil','Sonia','Sourav','Vishal'],
'Age^':[10,20,30,40],
'Ge%nder%':['M','F','M','M'],
'City':['J','K','L','P'],
'Place of      Work':[True,False,False,True],
 'Place-of-Work':[True,False,False,True]
}
)
df

	First Name	Age^	Ge%nder%	City	Place of Work	Place-of-Work
0	Sahil	10	M	J	True	True
1	Sonia	20	F	K	False	False
2	Sourav	30	M	L	False	False
3	Vishal	40	M	P	True	True

3.9.1. Remove spaces from names#

df2=df.copy(deep=True)

df2.columns=df2.columns.str.replace(' +','_')

/var/folders/y0/7jpzlk652q19ghz27zs91vt40000gp/T/ipykernel_29728/95460965.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  df2.columns=df2.columns.str.replace(' +','_')

df2

	First_Name	Age^	Ge%nder%	City	Place_of_Work	Place-of-Work
0	Sahil	10	M	J	True	True
1	Sonia	20	F	K	False	False
2	Sourav	30	M	L	False	False
3	Vishal	40	M	P	True	True

3.9.2. Remove Special Characters from Names#

df2=df.copy(deep=True)

df2

	First Name	Age^	Ge%nder%	City	Place of Work	Place-of-Work
0	Sahil	10	M	J	True	True
1	Sonia	20	F	K	False	False
2	Sourav	30	M	L	False	False
3	Vishal	40	M	P	True	True

# df2.columns = df2.columns.str.replace(' +', '_') # + means n numbers

# Rename specific columns explicitly if needed
df2.rename({'Ge%nder%':'Gender'},axis=1,inplace=True)

# If you want to do this step only if column of this name exists,then wrap it in if condition:
if 'Ge%nder%' in df2.columns:
    df2.rename({'Ge%nder%':'Gender'},axis=1,inplace=True)

df2

	First Name	Age^	Gender	City	Place of Work	Place-of-Work
0	Sahil	10	M	J	True	True
1	Sonia	20	F	K	False	False
2	Sourav	30	M	L	False	False
3	Vishal	40	M	P	True	True

df2.columns=df2.columns.str.replace('[^A-Za-z]+', '_')

/var/folders/y0/7jpzlk652q19ghz27zs91vt40000gp/T/ipykernel_29728/3413768078.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  df2.columns=df2.columns.str.replace('[^A-Za-z]+', '_')

df2

	First_Name	Age_	Gender	City	Place_of_Work	Place_of_Work
0	Sahil	10	M	J	True	True
1	Sonia	20	F	K	False	False
2	Sourav	30	M	L	False	False
3	Vishal	40	M	P	True	True

df2.columns = df2.columns.str.strip('_')    # used to remve specific characters from first and last position 

df2

	First_Name	Age	Gender	City	Place_of_Work	Place_of_Work
0	Sahil	10	M	J	True	True
1	Sonia	20	F	K	False	False
2	Sourav	30	M	L	False	False
3	Vishal	40	M	P	True	True

3.9.2.1. Patterns#

[^A-Za-z]+ - If you just want to ignore A-Z or a-z alphabets
[^A-Za-z\s]+ - If you just want to ignore A-Z or a-z alphabets and spaces
[^A-Za-z0-9]+ If you wantto ignore alphabets + numbers

3.9.2.2. Note#

All the string functions (like replace) can be chained as well
- df[‘col’].str.replace(‘%’,’’).str.replace(‘$’,’’)