Jun 05, 2024python numpy scipy sparse-matrix

Tinder Coach

Life's a garden. Dig it.

You operate an online business called tindercoach.com where you give people advice on their Tinder profiles ❤️‍🔥. You have a dictionary of visits indicating how many times each visitor_id visited each page on your site.

import random
import string
from collections import defaultdict
 
Npages = 10
Nvisitors = 10
Nvisits = 100
 
random.seed(2357)
visitor_ids = list(set(random.randint(1000, 9999) for i in range(Nvisitors)))
pages = list(set('tindercoach.com/' + ''.join(random.choices(string.ascii_lowercase, k=3)) for i in range(Nvisits)))
 
visits = defaultdict(int)
for i in range(Nvisits):
    key = (random.choice(visitor_ids), random.choice(pages))
    visits[key] += 1
 
print(visits)
# defaultdict(<class 'int'>, {
#  (3654, 'tindercoach.com/bgr'): 1, 
#  (1443, 'tindercoach.com/nky'): 1, 
#  (3654, 'tindercoach.com/wpb'): 1, 
#  ..., 
#  (3181, 'tindercoach.com/jam'): 1, 
#  (5502, 'tindercoach.com/cjp'): 1, 
#  (5502, 'tindercoach.com/tjk'): 1
# })

Convert visits into a Compressed Sparse Column (CSC) matrix where element (i,j) stores the number of times visitor i visited page j.

Then print the sub-matrix showing how many times visitors 1443, 6584, and 7040 visited pages tindercoach.com/chl, tindercoach.com/nky, and tindercoach.com/zmr.

Try with Google Colab

Solution¶

import numpy as np
from scipy import sparse
 
# build visitor_id:row_index mapping
row_keys = {}
for idx, val in enumerate(visitor_ids):
  row_keys[val] = idx
 
# build page:col_index mapping
col_keys = {}
for idx, val in enumerate(pages):
  col_keys[val] = idx
 
# determine the row & col index for each data point
row_idxs = []
col_idxs = []
for key in visits.keys():
    visitor_id = key[0]
    page = key[1]
    row_idxs.append(row_keys[visitor_id])
    col_idxs.append(col_keys[page])
 
# Build the csc matrix
data = list(visits.values())
mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
 
# print subset
print_visitor_ids = [3654, 1443, 3654]
print_pages = ['tindercoach.com/bgr', 'tindercoach.com/nky', 'tindercoach.com/wpb']
print_row_idxs = [row_keys[x] for x in print_visitor_ids]
print_col_idxs = [col_keys[x] for x in print_pages]
print(mat[np.ix_(print_row_idxs, print_col_idxs)].todense())
# [[0 1 0]
# [2 0 1]
# [1 0 0]]

Explanation

Our strategy is to build three lists:

data: contains the non-zero elements of the matrix
row_idxs: contains the row index of each non-zero element
col_idxs: contains the column index of each non-zero element

Then we can instantiate a CSC matrix with sparse.csc_matrix((data, (row_idxs, col_idxs))).

Build mappings for visitor_id:row_index and page:col_index.

# build visitor_id:row_index mapping
row_keys = {}
for idx, val in enumerate(visitor_ids):
  row_keys[val] = idx
 
print(row_keys)
# {
#  7040: 0, 
#  4545: 1, 
#  ..., 
#  6584: 8, 
#  5502: 9
# }
 
# build page:col_index mapping
col_keys = {}
for idx, val in enumerate(pages):
  col_keys[val] = idx
 
print(col_keys)
# {
#  'tindercoach.com/gqy': 0, 
#  'tindercoach.com/yez': 1, 
#  ..., 
#  'tindercoach.com/eaw': 98, 
#  'tindercoach.com/eow': 99
# }

These mappings allow us to get the row / column index for any visitor_id / page.

Determine the row and column index for each data point.
```
row_idxs = []
col_idxs = []
for key in visits.keys():
    visitor_id = key[0]                   
    page = key[1]                         
    row_idxs.append(row_keys[visitor_id]) 
    col_idxs.append(col_keys[page])       
```
The keys in visits are tuples of (visitor_id, page)s. Here we fetch the visitor_id from the first position of the current key tuple.

The keys in visits are tuples of (visitor_id, page)s. Here we fetch the page from the second position of the current key tuple.

row_keys[visitor_id] gives us the row index for the current visitor_id. We append this to row_idxs.

col_keys[page] gives us the row index for the current page. We append this to col_idxs.
Build the csc_matrix.
```
data = list(visits.values())  
mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
```
visits.values() returns the number of times each visitor visited each page.
```
print(visits.values())
# dict_values([1, 1, 1, 1, 2, 1, 1, 2, ...])
```
..but it returns this as a dict_values instance. We convert it to list so it can be properly handled by csc_matrix() in the next step.
Print the sub matrix showing visitors 1443, 6584, and 7040 and pages tindercoach.com/chl, tindercoach.com/nky, and tindercoach.com/zmr.

We can fetch the appropriate row and column indices as follows:
```
print_visitor_ids = [1443, 6584, 7040]
print_pages = ['tindercoach.com/chl', 'tindercoach.com/nky', 'tindercoach.com/zmr']
 
print_row_idxs = [row_keys[x] for x in print_visitor_ids]
print_col_idxs = [col_keys[x] for x in print_pages]
 
print(print_row_idxs)
# [2, 8, 0]
 
print(print_col_idxs)
# [38, 17, 86]
```
To fetch the sub matrix indexed by these row and column indices, we can do
```
submat = mat[np.ix_(print_row_idxs, print_col_idxs)]
 
print(submat.todense())
# [[0 1 0]
#  [2 0 1]
#  [1 0 0]]
```
If we simply did mat[print_row_idxs, print_col_idxs], scipy would fetch three elements from the matrix; the elements at positions: (2, 38), (8, 17), and (0, 86). This is consistent with NumPy array indexing behavior, but it's not the behavior we desire.

Rather, we want to fetch all combinations of (row, col) indices from our lists print_row_idxs and print_col_idxs. (9 combinations in total, forming a 3x3 sub matrix.)

We use np.ix_() to accomplish this. np.ix_(print_row_idxs, print_col_idxs) constructs an open mesh from the input lists / arrays.
np.ix_(print_row_idxs, print_col_idxs) # (array([[2], # [8], # [0]]), array([[38, 17, 86]]))
When these arrays are used to index mat, they are essentially broadcasted into the 3x3 i and j index arrays.
i index array (after broadcasting) [[2 2 2] [8 8 8] [0 0 0]] j index array (after broadcasting) [[38 17 86] [38 17 86] [38 17 86]]