You operate a dog shelter š¶, and you're building a model to predict the probability of a dog getting adopted based on its age, weight, and breed. You have the following data.
import numpy as np
import pandas as pd
# dog breeds list
breeds = [
'english pointer' , 'english setter' , 'kerry blue terrier' , 'cairn terrier' , 'english cocker spaniel' ,
'gordon setter' , 'airedale terrier' , 'australian terrier' , 'bedlington terrier' , 'border terrier' ,
'bull terrier' , 'fox terrier (smooth)' , 'english toy terrier (black &tan)' , 'swedish vallhund' ,
'belgian shepherd dog' , 'old english sheepdog' , 'griffon nivernais' , 'briquet griffon vendeen' ,
'ariegeois' , 'gascon saintongeois' , 'great gascony blue' , 'poitevin' , 'billy' , 'artois hound' ,
'porcelaine' , 'small blue gascony' , 'blue gascony griffon' , 'grand basset griffon vendeen' ,
'norman artesien basset' , 'blue gascony basset'
]
# random generator
rng = np.random.default_rng( 1 )
# training DataFrame
Ndogs = 10
dogs = pd.DataFrame({
'id' : rng.choice( 1000000 , size = Ndogs, replace = False ),
'age' : rng.uniform( low = 0 , high = 16 , size = Ndogs).round( 0 ),
'weight' : rng.uniform( low = 10 , high = 115 , size = Ndogs).round( 1 ),
'breed' : pd.Categorical(rng.choice(breeds[: 25 ], size = Ndogs, replace = True ), categories = breeds),
'adopted' : rng.choice([ True , False ], size = Ndogs, replace = True )
})
# Insert some NaNs
dogs.iloc[rng.choice(Ndogs, size = int (Ndogs * 0.25 ), replace = False ), 1 ] = np.nan
print (dogs)
# id age weight breed adopted
# 0 311831 NaN 88.8 english pointer False
# 1 473184 9.0 39.4 bull terrier False
# 2 822941 5.0 60.9 english toy terrier (black &tan) False
# 3 34852 13.0 113.0 australian terrier True
# 4 948647 NaN 111.0 kerry blue terrier True
# 5 511817 7.0 86.1 bull terrier False
# 6 144159 2.0 66.8 old english sheepdog False
# 7 755162 6.0 39.1 fox terrier (smooth) False
# 8 950457 3.0 26.9 gascon saintongeois True
# 9 249228 4.0 111.8 border terrier True
Build a compressed sparse column matrix to represent the training features. Be sure to one-hot-encode the dog breeds into indicator columns, one for each possible breed (not just the observed breeds š).
The output matrix should look something like this:
age weight is_english_pointer is_english_setter ...
0 NaN 88.8 1.0 0.0
1 9.0 39.4 0.0 0.0
2 5.0 60.9 0.0 0.0
3 13.0 113.0 0.0 0.0
4 NaN 111.0 0.0 0.0
...
SolutionĀ¶
This content is gated
Subscribe to the product below to gain access