3.9. Cleaning Columns#
import pandas as pd
import numpy as np
df=pd.DataFrame({
'First Name':['Sahil','Sonia','Sourav','Vishal'],
'Age^':[10,20,30,40],
'Ge%nder%':['M','F','M','M'],
'City':['J','K','L','P'],
'Place of Work':[True,False,False,True],
'Place-of-Work':[True,False,False,True]
}
)
df
First Name | Age^ | Ge%nder% | City | Place of Work | Place-of-Work | |
---|---|---|---|---|---|---|
0 | Sahil | 10 | M | J | True | True |
1 | Sonia | 20 | F | K | False | False |
2 | Sourav | 30 | M | L | False | False |
3 | Vishal | 40 | M | P | True | True |
3.9.1. Remove spaces from names#
df2=df.copy(deep=True)
df2.columns=df2.columns.str.replace(' +','_')
/var/folders/y0/7jpzlk652q19ghz27zs91vt40000gp/T/ipykernel_29728/95460965.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
df2.columns=df2.columns.str.replace(' +','_')
df2
First_Name | Age^ | Ge%nder% | City | Place_of_Work | Place-of-Work | |
---|---|---|---|---|---|---|
0 | Sahil | 10 | M | J | True | True |
1 | Sonia | 20 | F | K | False | False |
2 | Sourav | 30 | M | L | False | False |
3 | Vishal | 40 | M | P | True | True |
3.9.2. Remove Special Characters from Names#
df2=df.copy(deep=True)
df2
First Name | Age^ | Ge%nder% | City | Place of Work | Place-of-Work | |
---|---|---|---|---|---|---|
0 | Sahil | 10 | M | J | True | True |
1 | Sonia | 20 | F | K | False | False |
2 | Sourav | 30 | M | L | False | False |
3 | Vishal | 40 | M | P | True | True |
# df2.columns = df2.columns.str.replace(' +', '_') # + means n numbers
# Rename specific columns explicitly if needed
df2.rename({'Ge%nder%':'Gender'},axis=1,inplace=True)
# If you want to do this step only if column of this name exists,then wrap it in if condition:
if 'Ge%nder%' in df2.columns:
df2.rename({'Ge%nder%':'Gender'},axis=1,inplace=True)
df2
First Name | Age^ | Gender | City | Place of Work | Place-of-Work | |
---|---|---|---|---|---|---|
0 | Sahil | 10 | M | J | True | True |
1 | Sonia | 20 | F | K | False | False |
2 | Sourav | 30 | M | L | False | False |
3 | Vishal | 40 | M | P | True | True |
df2.columns=df2.columns.str.replace('[^A-Za-z]+', '_')
/var/folders/y0/7jpzlk652q19ghz27zs91vt40000gp/T/ipykernel_29728/3413768078.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
df2.columns=df2.columns.str.replace('[^A-Za-z]+', '_')
df2
First_Name | Age_ | Gender | City | Place_of_Work | Place_of_Work | |
---|---|---|---|---|---|---|
0 | Sahil | 10 | M | J | True | True |
1 | Sonia | 20 | F | K | False | False |
2 | Sourav | 30 | M | L | False | False |
3 | Vishal | 40 | M | P | True | True |
df2.columns = df2.columns.str.strip('_') # used to remve specific characters from first and last position
df2
First_Name | Age | Gender | City | Place_of_Work | Place_of_Work | |
---|---|---|---|---|---|---|
0 | Sahil | 10 | M | J | True | True |
1 | Sonia | 20 | F | K | False | False |
2 | Sourav | 30 | M | L | False | False |
3 | Vishal | 40 | M | P | True | True |
3.9.2.1. Patterns#
[^A-Za-z]+ - If you just want to ignore A-Z or a-z alphabets
[^A-Za-z\s]+ - If you just want to ignore A-Z or a-z alphabets and spaces
[^A-Za-z0-9]+ If you wantto ignore alphabets + numbers
3.9.2.2. Note#
All the string functions (like replace) can be chained as well
df[‘col’].str.replace(‘%’,’’).str.replace(‘$’,’’)