pythonnumpy
Ben Gorman

Ben Gorman

Life's a garden. Dig it.

Advanced Array Indexing

Earlier we discussed basic array indexing techniques. Here we discuss advanced techniques and take a deeper dive into how array indexing works.

We start by setting up a 3x2x4 array of integers, foo.

foo = np.arange(3*2*4).reshape((3,2,4))
print(foo)
# [[[ 0  1  2  3]
#   [ 4  5  6  7]]
#  [[ 8  9 10 11]
#   [12 13 14 15]]
#  [[16 17 18 19]
#   [20 21 22 23]]]

Observe the result of foo[:,:,0]

print(foo[:,:,0])
# [[ 0  4]
#  [ 8 12]
#  [16 20]]

Recall the mental model for interpreting an N-dimensional array.

  1. 1-d array: imagine a row of values
  2. 2-d array: imagine a matrix of values
  3. 3-d array: imagine a row of matrices
  4. 4-d array: imagine a matrix of matrices

In this case, we can think of foo as a row of matrices where element (i,j,k) corresponds to the ith matrix, the jth row, and the kth column. So, foo[:,:,0] is requesting "every matrix, every row, column 0" resulting in this sub-array.

# [[ 0  4]
#  [ 8 12]
#  [16 20]]

What's surprising is that we started with a 3-dimensional array, but foo[:,:,0] returned a 2-dimensional array. When we select more than one column, for example like this

print(foo[:,:,[0,1]])
# [[[ 0  1]
#   [ 4  5]]
#  [[ 8  9]
#   [12 13]]
#  [[16 17]
#   [20 21]]]

the result is a 3-dimensional array as you might expect.

Subsetting with same-shape index arrays

Let's see another example, this time using 3x2 index arrays for i, j, and k indexes of foo.

i = [
  [0,2], 
  [2,0], 
  [1,1]
]
 
j = [
  [0,0], 
  [0,0], 
  [1,1]
]
 
k = [
  [0,1], 
  [0,2], 
  [0,3]
]
 
print(foo[i,j,k])
# [[ 0 17]
#  [16  2]
#  [12 15]]

Important

When every dimension is indexed with an array (or list) and each of those arrays is the same shape, the output array will be the same shape as the index arrays.

In this case each of our index arrays is 3x2, so we know our result will also be a 3x2 array.

To understand the values in the result, you could imagine zipping the three index arrays into a matrix of 3-element tuples like this

[[(0,0,0), (2,0,1)],
 [(2,0,0), (0,0,2)],
 [(1,1,0), (1,1,3)]]

Here, each tuple gives the location of the corresponding output element.

Here's another example where we use 2x2x2x2 index arrays to extract values from foo. Notice the result is also 2x2x2x2.

idx = np.zeros(shape=(2,2,2,2), dtype='int64')
result = foo[idx, idx, idx]
result.shape  # (2, 2, 2, 2)

Again, the shape of the output is dependent on the shape of the index arrays, not the shape of the array you're indexing into.

Subsetting with varying-shape index arrays

We've considered the case where every dimension has an index array and each of the index arrays are the same shape. What if every dimension has an index array, but the index arrays have different shapes?

i = [0,1]
j = [0,1]
k = [[0],
     [2],
     [3]]
 
print(foo[i,j,k])
# [[ 0 12]
#  [ 2 14]
#  [ 3 15]]

In this case, NumPy broadcasts the index arrays, in essence forming three 2x3 index arrays.

i = [[0, 1],
     [0, 1],
     [0, 1]]
 
j = [[0, 1],
     [0, 1],
     [0, 1]]
 
k = [[0, 0],
     [2, 2],
     [3, 3]]

Tip

You can do np.broadcast_arrays(i,j,k) to see how the original index arrays broadcast into identically shaped index arrays. See broadcast_arrays().

Subsetting foo with equivalently shaped index arrays is straight-forward based on what we already covered.

Subsetting with sice indexers

Lastly, we consider the case where our array contains slice indexers :. For example,

i = [0,0,2,2]
k = [[0],
     [1],
     [2]]
 
print(foo[i,:,k])
# [[[ 0  4]
#   [ 0  4]
#   [16 20]
#   [16 20]]
#  [[ 1  5]
#   [ 1  5]
#   [17 21]
#   [17 21]]
#  [[ 2  6]
#   [ 2  6]
#   [18 22]
#   [18 22]]]

What's going on here? The trick to understanding these complex scenarios is to build fully expanded, same-shaped index arrays for each dimension as follows:

broadcast all index arrays
for each slicer:
  Copy each index array N times along a new last axis where N is the size of the current slicer's dimension
  Represent the current slicer with an index array

Let's walk through that last example. We start by broadcasting the index arrays into the following (3,4) arrays.

i = [[0, 0, 2, 2],
     [0, 0, 2, 2],
     [0, 0, 2, 2]]
k = [[0, 0, 0, 0],
     [1, 1, 1, 1],
     [2, 2, 2, 2]]

Now we create an index array for j. Start with a (3,4) array of ?s.

j = [[?, ?, ?, ?],
     [?, ?, ?, ?],
     [?, ?, ?, ?]]

Since j spans the second axis of foo which is length-2, replace each ? with the indices [0,1].

j = [[[0,1], [0,1], [0,1], [0,1]],
     [[0,1], [0,1], [0,1], [0,1]],
     [[0,1], [0,1], [0,1], [0,1]]]

j is then promoted from a (3,4) array to a (3,4,2) array.

i and k are then each copied twice along a new last axis to match j's shape. The final (3,4,2) index arrays are

i = [[[0, 0],
     [0, 0],
     [2, 2],
     [2, 2]],
     [[0, 0],
     [0, 0],
     [2, 2],
     [2, 2]],
     [[0, 0],
     [0, 0],
     [2, 2],
     [2, 2]]]
j = [[[0, 1],
     [0, 1],
     [0, 1],
     [0, 1]],
     [[0, 1],
     [0, 1],
     [0, 1],
     [0, 1]],
     [[0, 1],
     [0, 1],
     [0, 1],
     [0, 1]]]
k = [[[0, 0],
     [0, 0],
     [0, 0],
     [0, 0]],
     [[1, 1],
     [1, 1],
     [1, 1],
     [1, 1]],
     [[2, 2],
     [2, 2],
     [2, 2],
     [2, 2]]]

Note

NumPy doesn't actually expand the index arrays in this way as it'd be innefficient, but NumPy's indexing algorithm is logically equivalent to this type of index array expansion.

Note

The slicer we used in this example : is a full slice but we could use fancier slices like ::2 which gets every second element or :3:-1 which gets the first three elements in reverse order.

Trailing Indexes

It's also worth mentioning that you can exclude trailing indexes in which case NumPy assumes you want all the values from those excluded dimensions. For example, foo[[0,1]] is equivalent to foo[[0,1], :, :].

Indexing with a scalar

Recall foo[:,:,0] returns a two-dimensional array even though foo is a three-dimensional array. Why?

Recall the algorithm for determining the output of a subset with slicers and index arrays.

broadcast all index arrays
for each slicer:
  Copy each index array N times along a new last axis where N is the size of the current slicer's dimension
  Represent the current slicer with an index array

We start by broadcasting all the index arrays. In this case there's only one index array - 0 - so we don't have to do any broadcasting, but what's the shape of a scalar like 0? It's actually empty or 0-dimensional. You can see this by calling

np.array(0).shape  # ()

So we have an empty dimensional index, (). Then we concatenate the size of each dimension with a slice, so we end up with a (3,2) result.

If we had wrapped the 0 index in square brackets, the final result would be a (3,2,1) array.

foo[:,:,[0]].shape  # (3, 2, 1)

View vs Copy

It's important to understand when array indexing produces a view and when it produces a copy.

Suppose we have this 2d array, spongebob

spongebob = np.arange(12).reshape(3,-1)
print(spongebob)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

We can pick out the first two columns using array indexers or slicing.

squidward = spongebob[:, [0,1]]
patrick = spongebob[:, :2]

On the surface these techniques seem to produce the same result

np.all(patrick == squidward)  # True

but under the hood, these techniques are subtly different. squidward is a copy of spongebob while patrick is a view of spongebob.

Observe that if we modify squidward, spongebob is unchanged.

squidward[0,0] = 100
print(spongebob)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

By contrast, if we modify patrick, spongebob changes too.

patrick[0,0] = 100
print(spongebob)
# [[100   1   2   3]
#  [  4   5   6   7]
#  [  8   9  10  11]]

Array memory addresses

Another way to observe this is by looking up the memory address of the beginning of each array.

spongebob.__array_interface__  # {'data': (105553169745024, False), ...
squidward.__array_interface__  # {'data': (105553142497328, False), ...
patrick.__array_interface__    # {'data': (105553169745024, False), ...

It's clear that spongebob and patrick live at the same memory address, but the location of squidward is elsewhere.

Warning

The memory addresses listed here are the address of the beginning of each array. If two addresses are different, the arrays don't start at the same place, but portions of the arrays may still overlap!

A better technique to check if arrays share memory is to use NumPy's shares_memory() function.

np.shares_memory(spongebob, squidward)  # False
np.shares_memory(spongebob, patrick)    # True

How to force copy an array

In, general, when you subset an array using nothing but slices, NumPy returns a view of the original array. You can force numpy to copy the data using the array copy() method like this.

spongebob[:, :2].copy()

Note

When you subset an array using at least one index array, numpy will automatically copy the data.