A sparse matrix is a matrix with a bunch of 0s, like this.
Numerous methods exist to store sparse matrices in compressed formats that consume significantly less memory than typical dense1 matrices.
Compressed sparse matrices in Python are most commonly handled with the sparse module from scipy. Accordingly, this problem set tests your knowledge of scipy's sparse matrices.
Do you know NumPy?
scipy is built on top of NumPy. We highly recommend you learn some NumPy before learning scipy sparse matrices.
How to install scipy¶
How to import scipy's sparse module¶
Sparse Matrix Compression Formats¶
Sparse matrices can be stored in various compressed formats, each with their own pros and cons.
Format | Details | Pros | Cons |
---|---|---|---|
Dictionary Of Keys (DOK) | uses dictionary to map (row, column)-pairs to the value of the elements | - fast access to individual elements - good for incremental matrix construction | - slow access to rows, cols |
COOrdinate Format (COO) | stores a list of (row, column, value) tuples | - good for incremental matrix construction | - slow access to elements - slow access rows, cols |
List of Lists (LIL) | stores one list per row, with each entry containing the column index and the value | - good for incremental matrix construction | - slow access to elements - slow access to rows, cols |
Compressed Sparse Row (CSR) | Stores three arrays: - non-zero entries - column indices of non-zero entries - cumulative count of non-zero entries, per row (prepended by 0) | - fast row access and matrix-vector multiplications | - can be bad for incremental matrix construction |
Compressed Sparse Column (CSC) | Stores three arrays: - non-zero entries - row indices of non-zero entries - cumulative count of non-zero entries, per column (prepended by 0) | - fast column access and matrix-vector multiplications | - can be bad for incremental matrix construction |
Compressed Sparse Row (CSR)¶
Suppose you have a sparse matrix like this.
To store this matrix in CSR format requires three arrays:
data
: an array containing the value of each non-zero element
In this example:[3, 1, 1, 2, 1]
indices
: an array containing the column index of each non-zero element
In this example:[2, 4, 0, 3, 4]
indptr
: an array containing the cumulative count of non-zero elements per row of the matrix, prepended by 0 In this example:[0, 1, 2, 4, 4, 5]
You can confirm this in scipy by converting uncompressed
to a csr_matrix
and inspecting its data
, indices
, and indptr
attributes.
Why is this useful?
It's incredibly efficient to reconstruct and operate on rows of this matrix because indptr
provides a direct way to access the non-zero elements of any row. For example consider the following request:
Fetch the non-zero elements in the third row of the matrix (row index 2).
To accomplish this,
-
Look up the number of non-zero elements that occur before row index 2.
-
Look up the number of non-zero elements that occur on or before row index 2.
-
Use those values as slice indices to fetch the corresponding
data
elements. -
Fetching their corresponding column indices is just as easy.
Compressed Sparse Column (CSC)¶
Compressed Sparse Column format is just a reflection of Compressed Sparse Row format. For example, suppose you have the following sparse matrix.
To store this matrix in CSC format requires three arrays:
data
: an array containing the value of each non-zero element
In this example:[1, 3, 2, 1, 1]
indices
: an array containing the row index of each non-zero element
In this example:[2, 0, 2, 1, 4]
indptr
: an array containing the cumulative count of non-zero elements per column of the matrix, prepended by 0 In this example:[0, 1, 1, 2, 3, 5]
You can confirm this in scipy by converting uncompressed
to a csc_matrix
and inspecting its data
, indices
, and indptr
attributes.
Why is this useful?
It's incredibly efficient to reconstruct and operate on columns of this matrix because indptr
provides a direct way to access the non-zero elements of any column. For example consider the following request:
Fetch the non-zero elements in the fifth column of the matrix (column index 4).
To accomplish this,
-
Look up the number of non-zero elements that occur before column index 4.
-
Look up the number of non-zero elements that occur on or before column index 4.
-
Use those values to fetch the corresponding
data
elements. -
Fetching their corresponding row indices is just as easy.
Basic Use¶
Instantiation¶
There are many ways to instantiate a compressed sparse matrix in scipy. Perhaps the most common technique is to use (row, col, val) triplets. For example,
It's also common to convert a NumPy array to a compressed sparse matrix.
This is generally considered bad practice. If your goal is to build a compressed sparse matrix to save memory, building a numpy array at any stage in the process defeats this purpose. Seek a better instantiation method.
Data Access¶
Given a compressed sparse matrix, mat
,
You can access rows and columns just like a NumPy array.
How do I print the matrix?¶
print(mat)
prints the non-zero entries of the matrix.
If you want to actually print the matrix, convert it to a NumPy array.
Beware
This can blow up your computer if the sparse matrix is very large.
Footnotes
-
A dense matrix is one in which most of the elements are non-zero. However, scipy (and others) use the term "dense" to refer to a matrix that's not in compressed format. ↩