Fed up with your city's roads, you go around collecting data on potholes
in your area. Due to an unfortunate ☕ coffee spill, you lost bits and pieces of your data.
import numpy as np
import pandas as pd
potholes = pd.DataFrame({
'length':[5.1, np.nan, 6.2, 4.3, 6.0, 5.1, 6.5, 4.3, np.nan, np.nan],
'width':[2.8, 5.8, 6.5, 6.1, 5.8, np.nan, 6.3, 6.1, 5.4, 5.0],
'depth':[2.6, np.nan, 4.2, 0.8, 2.6, np.nan, 3.9, 4.8, 4.0, np.nan],
'location':pd.Series(['center', 'north edge', np.nan, 'center', 'north edge', 'center', 'west edge',
'west edge', np.nan, np.nan], dtype='string')
})
print(potholes)
# length width depth location
# 0 5.1 2.8 2.6 center
# 1 NaN 5.8 NaN north edge
# 2 6.2 6.5 4.2 <NA>
# 3 4.3 6.1 0.8 center
# 4 6.0 5.8 2.6 north edge
# 5 5.1 NaN NaN center
# 6 6.5 6.3 3.9 west edge
# 7 4.3 6.1 4.8 west edge
# 8 NaN 5.4 4.0 <NA>
# 9 NaN 5.0 NaN <NA>
Given your DataFrame of pothole measurements, discard rows where more than half the values are NaN
, elsewhere impute NaNs
with the average value per column unless the column is non-numeric, in which case use the mode.