You operate an online business called tindercoach.com where you give people advice on their Tinder profiles ❤️🔥. You have a dictionary of visits
indicating how many times each visitor_id
visited each page
on your site.
import random
import string
from collections import defaultdict
Npages = 10
Nvisitors = 10
Nvisits = 100
random.seed(2357)
visitor_ids = list(set(random.randint(1000, 9999) for i in range(Nvisitors)))
pages = list(set('tindercoach.com/' + ''.join(random.choices(string.ascii_lowercase, k=3)) for i in range(Nvisits)))
visits = defaultdict(int)
for i in range(Nvisits):
key = (random.choice(visitor_ids), random.choice(pages))
visits[key] += 1
print(visits)
# defaultdict(<class 'int'>, {
# (3654, 'tindercoach.com/bgr'): 1,
# (1443, 'tindercoach.com/nky'): 1,
# (3654, 'tindercoach.com/wpb'): 1,
# ...,
# (3181, 'tindercoach.com/jam'): 1,
# (5502, 'tindercoach.com/cjp'): 1,
# (5502, 'tindercoach.com/tjk'): 1
# })
Convert visits
into a Compressed Sparse Column (CSC) matrix where element (i,j) stores the number of times visitor i visited page j.
Then print the sub-matrix showing how many times visitors 1443, 6584, and 7040 visited pages tindercoach.com/chl, tindercoach.com/nky, and tindercoach.com/zmr.
Solution¶
import numpy as np
from scipy import sparse
# build visitor_id:row_index mapping
row_keys = {}
for idx, val in enumerate(visitor_ids):
row_keys[val] = idx
# build page:col_index mapping
col_keys = {}
for idx, val in enumerate(pages):
col_keys[val] = idx
# determine the row & col index for each data point
row_idxs = []
col_idxs = []
for key in visits.keys():
visitor_id = key[0]
page = key[1]
row_idxs.append(row_keys[visitor_id])
col_idxs.append(col_keys[page])
# Build the csc matrix
data = list(visits.values())
mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
# print subset
print_visitor_ids = [3654, 1443, 3654]
print_pages = ['tindercoach.com/bgr', 'tindercoach.com/nky', 'tindercoach.com/wpb']
print_row_idxs = [row_keys[x] for x in print_visitor_ids]
print_col_idxs = [col_keys[x] for x in print_pages]
print(mat[np.ix_(print_row_idxs, print_col_idxs)].todense())
# [[0 1 0]
# [2 0 1]
# [1 0 0]]
Explanation
Our strategy is to build three lists:
data
: contains the non-zero elements of the matrixrow_idxs
: contains the row index of each non-zero elementcol_idxs
: contains the column index of each non-zero element
Then we can instantiate a CSC matrix with sparse.csc_matrix((data, (row_idxs, col_idxs)))
.
-
Build mappings for
visitor_id:row_index
andpage:col_index
.# build visitor_id:row_index mapping row_keys = {} for idx, val in enumerate(visitor_ids): row_keys[val] = idx print(row_keys) # { # 7040: 0, # 4545: 1, # ..., # 6584: 8, # 5502: 9 # } # build page:col_index mapping col_keys = {} for idx, val in enumerate(pages): col_keys[val] = idx print(col_keys) # { # 'tindercoach.com/gqy': 0, # 'tindercoach.com/yez': 1, # ..., # 'tindercoach.com/eaw': 98, # 'tindercoach.com/eow': 99 # }
These mappings allow us to get the row / column index for any visitor_id / page.
-
Determine the row and column index for each data point.
row_idxs = [] col_idxs = [] for key in visits.keys(): visitor_id = key[0] page = key[1] row_idxs.append(row_keys[visitor_id]) col_idxs.append(col_keys[page])
-
Build the
csc_matrix
.data = list(visits.values()) mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
-
Print the sub matrix showing visitors 1443, 6584, and 7040 and pages tindercoach.com/chl, tindercoach.com/nky, and tindercoach.com/zmr.
We can fetch the appropriate row and column indices as follows:
print_visitor_ids = [1443, 6584, 7040] print_pages = ['tindercoach.com/chl', 'tindercoach.com/nky', 'tindercoach.com/zmr'] print_row_idxs = [row_keys[x] for x in print_visitor_ids] print_col_idxs = [col_keys[x] for x in print_pages] print(print_row_idxs) # [2, 8, 0] print(print_col_idxs) # [38, 17, 86]
To fetch the sub matrix indexed by these row and column indices, we can do
submat = mat[np.ix_(print_row_idxs, print_col_idxs)] print(submat.todense()) # [[0 1 0] # [2 0 1] # [1 0 0]]
If we simply did
mat[print_row_idxs, print_col_idxs]
, scipy would fetch three elements from the matrix; the elements at positions:(2, 38)
,(8, 17)
, and(0, 86)
. This is consistent with NumPy array indexing behavior, but it's not the behavior we desire.Rather, we want to fetch all combinations of (row, col) indices from our lists
print_row_idxs
andprint_col_idxs
. (9 combinations in total, forming a 3x3 sub matrix.)We use
np.ix_()
to accomplish this.np.ix_(print_row_idxs, print_col_idxs)
constructs an open mesh from the input lists / arrays.np.ix_(print_row_idxs, print_col_idxs) # (array([[2], # [8], # [0]]), array([[38, 17, 86]]))
When these arrays are used to index
mat
, they are essentially broadcasted into the 3x3 i and j index arrays.i index array (after broadcasting) [[2 2 2] [8 8 8] [0 0 0]] j index array (after broadcasting) [[38 17 86] [38 17 86] [38 17 86]]