Apr 10, 2024python numpy pandas

Series

Life's a garden. Dig it.

What's a Series?¶

Series is one of the fundamental data structures in pandas. It's essentially an array with an index. Because it's an array, every value in a Series must be of the same type. You can have a Series of ints, a Series of floats, or a Series of booleans, but you can't have a Series of ints, floats and booleans together.

Series Documentation¶

You'll want to familiarize yourself with pandas' documentation. Here's the documentation for Series. It's the first place you should look when you have questions about a Series or Series method.

Series Creation¶

How to make a Series from a list¶

The easiest way to make a series is from a list.

x = pd.Series([5, 10, 15, 20, 25, 30, 35])

If we print the series, we get back something like this

print(x)
# 0     5
# 1    10
# 2    15
# 3    20
# 4    25
# 5    30
# 6    35
# dtype: int64

Notice how it already looks a bit different from a NumPy array. The column of values on the left is the Series index which you can use to access the Series elements in creative and meaningful ways. More on that later..

Also notice the output includes 'dtype int64' which tells us the data type of the elements in the Series.

How to check if an object is a Series¶

You can use Python's type() function to check that x is indeed a Series object.

type(x)  # pandas.core.series.Series

How to check the type of data stored in a Series¶

If you want to check the internal data type of the Series elements without printing the whole Series, you can use the Series.dtype attribute.

x.dtype  # int64

How to access the underlying NumPy array¶

Most pandas Series store the underlying data as a NumPy array. You can access the underlying NumPy array via Series.to_numpy().

x.to_numpy()
# array([ 5, 10, 15, 20, 25, 30, 35])

You might also see people using the Series.values attribute here, but this technique is not recommended.

How to access the first N elements of a Series¶

You can use the highly popular Series.head() method to pick out the first N elements of a Series. For example, x.head(6) returns the first 6 elements of x as a new Series.

x.head(6)
# 0     5
# 1    10
# 2    15
# 3    20
# 4    25
# 5    30
# dtype: int64

How to access the last N elements of a Series¶

You can use Series.tail() to pick out the last N elements of a Series. For example, x.tail(3) returns the last 3 elements of x as a new Series.

x.tail(3)
# 4    25
# 5    30
# 6    35
# dtype: int64

How to make a Series from a dictionary¶

You can make a Series from a python dictionary, like this

data = {'a' : 0., 'b' : 1., 'c' : 2., 'd': 3.}
y = pd.Series(data)
 
print(y)
# a    0.0
# b    1.0
# c    2.0
# d    3.0
# dtype: float64

In this case, pandas uses the dictionary keys for the series index and the dictionary values for the series values. Again, we'll cover the index and its purpose shortly. For now, just know it's a thing.

How to make a Series of strings¶

If we wanted to make a Series of strings, we could do that too.

z = pd.Series(['frank', 'dee', 'dennis'])

If we print(z), notice the dtype is listed as "object".

print(z)
 
# 0     frank
# 1       dee
# 2    dennis
# dtype: object

Why?
The short answer is, this is not a Series of strings. Rather, this is a Series of pointers. Since strings are objects that vary in size, but arrays (and thus Series) use fixed-size memory blocks to store their data, pandas implements a common trick - store the strings randomly in memory and put the address of each string in the underlying array. (Memory addresses are fixed-size objects - usually just 64-bit integers). If you're confused by this - don't worry, it's a tricky concept that'll make more sense later on.

The newer and better approach to creating a Series of strings is to specify dtype='string'.

z = pd.Series(['frank', 'dee', 'dennis'], dtype='string')

Now when we print(z), pandas reports the dtype as 'string'.

print(z)
# 0     frank
# 1       dee
# 2    dennis
# dtype: string

(There's a lot to discuss here, but we'll cover these things later.)

How to make a Series from a NumPy array¶

Perhaps the most powerful way to make a Series from scratch is to make it from a NumPy array.

# import numpy and pandas
import numpy as np
import pandas as pd

If you have a NumPy array like this

x = np.array([10, 20, 30, 40])

you can convert it to a Series just by passing x into pd.Series()

pd.Series(x)
# 0    10
# 1    20
# 2    30
# 3    40
# dtype: int64

Why is this so "powerful"?

Well, suppose you wanted to make a complex Series from scratch like a random sample of values from a normal distribution. The somewhat lame, but practical way to do this is to use NumPy. NumPy has lots of great tools for making arrays from scratch, and converting them into a Series is a piece of cake 🍰.

Is your NumPy rusty?

Check out our NumPy problem set

Series Basic Indexing¶

Suppose we have the following Series, x.

x = pd.Series([5, 10, 15, 20, 25])
 
print(x)
# 0     5
# 1    10
# 2    15
# 3    20
# 4    25
# dtype: int64

If you wanted to access the ith element of the Series, you might be inclined to use square-bracket indexing notation just like accessing elements from a Python list or a NumPy array.

x[0]  # 5
x[1]  # 10

x[0] returns the 1st element, x[1] returns the 2nd element and so on.

This appears to work like List indexing, but don't be fooled! x[0] actually returns the element(s) of the Series with index label 0. In this example, that element happens to be the first element in the Series, but if we shuffle the index like this

x.index = [3,1,4,0,2]
 
print(x)
# 3     5
# 1    10
# 4    15
# 0    20
# 2    25
# dtype: int64

now x[0] returns 20 instead of 5.

x[0]  # 20

However, if we change the index to ['a','b','c','d','e']

x.index = ['a','b','c','d','e']
 
print(x)
# a     5
# b    10
# c    15
# d    20
# e    25
# dtype: int64

This time, x[0] does return the first value in the Series.

x[0]  # 5

Caution

The takeaway here is that square-bracket indexing in pandas isn't straight-forward. Its behavior changes depending on characteristics of the Series. For this reason, we recommend using more explicit indexing techniques - Series.iloc and Series.loc.

Indexing by position¶

x = pd.Series([5, 10, 15, 20, 25])
 
print(x)
# 0     5
# 1    10
# 2    15
# 3    20
# 4    25
# dtype: int64

How to access the ith value of a Series¶

Use the Series.iloc property to access the ith value in a Series.

x.iloc[0]  #  5, get the first value in the Series
x.iloc[1]  # 10, get the second value in the Series

Negative Indexing¶

Series.iloc supports negative indexing like Python lists and NumPy arrays.

x.iloc[-1]  # 25 | last element
x.iloc[-2]  # 20 | second-to-last element
x.iloc[-3]  # 15 | third-to-last element

Positional Slicing¶

Series.iloc supports negative indexing like Python lists and NumPy arrays.

x.iloc[1:4:2]  # get values at position 1 to position 4 stepping by 2
# 1    10
# 3    20
# dtype: int64

Notice the result is a Series object whereas in the previous examples the results were scalars.

How to select multiple elements by position¶

Series.iloc can receive a list, array, or Series of integers to select multiple values in x.

x.iloc[[0, 2, 3]]             # 5, 15, 20
x.iloc[np.array([0, 2, 3])]   # 5, 15, 20
x.iloc[pd.Series([0, 2, 3])]  # 5, 15, 20

Indexing by label¶

Let's talk about the index. Every Series has an index and its purpose is to provide a label for each element in the Series. When you make a Series from scratch, it automatically gets an index of sequential values starting from 0.

For example, here we make a Series to represent the test grades of five students, and you can see how the index automatically gets created.

grades = pd.Series([82, 94, 77, 89, 91, 54])
 
print(grades)
# 0    82
# 1    94
# 2    77
# 3    89
# 4    91
# 5    54
# dtype: int64

We can change the index pretty easily, just by setting it equal to another array, list, or Series of values with the proper length. The index values don't even need to be integers, and in fact, they're often represented as strings.

grades.index = ['homer', 'maggie', 'grandpa', 'bart', 'lisa', 'marge']
 
print(grades)
# homer      82
# maggie     94
# grandpa    77
# bart       89
# lisa       91
# marge      54
# dtype: int64

How to access the value of a Series with label¶

To fetch a Series value(s) with some specific label, use the Series.loc method.

For example, to get bart's grade in the Series above, we can do grades.loc['bart'].

grades.loc['bart']  # 89

Label Slicing¶

Series.loc supports slicing by label. For example, to fetch the grades between homer and grandpa, we could do grades.loc['homer':'grandpa'].

grades.loc['homer':'grandpa']
 
# homer      82
# maggie     94
# grandpa    77
# dtype: int64

Warning

Notice that the slice 'homer':'grandpa' includes homer and grandpa. By contrast, the equivalent positional slice 0:2 would exclude the right endpoint (grandpa).

How to select multiple elements by label¶

Just like Series.iloc[], we can pass a list, array, or Series of labels into Series.loc[] to retrieve multiple elements.

grades.loc[['homer', 'grandpa', 'bart']]
# homer      82
# grandpa    77
# bart       89
# dtype: int64

RangeIndex¶

When you make a Series without specifying its index, pandas automatically gives it a RangeIndex.

x = pd.Series(np.random.normal(size=5))
 
print(x)
# 0    0.651743
# 1    0.311423
# 2    0.103382
# 3   -3.614402
# 4   -0.046355
# dtype: float64
 
print(x.index)
# RangeIndex(start=0, stop=5, step=1)

By contrast, when you explicitly set the index as a list of integers, pandas gives it an Int64Index.

x = pd.Series(np.random.normal(size=5), index=[0,1,2,3,4])
 
print(x)
# 0   -0.091815
# 1   -0.823428
# 2    1.394426
# 3    1.263174
# 4   -0.421659
# dtype: float64
 
print(x.index)
# Int64Index([0, 1, 2, 3, 4], dtype='int64')

For most situations, the difference is irrelevant. However, note that the RangeIndex is more memory efficient and has faster access times.

Modifying Series Data¶

Consider this Series foo.

foo = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])

Basic Series Modifications¶

We can change the second element to 200.

foo.iloc[1] = 200

foo.loc['b'] = 200

We can set the 1st, 2nd and 3rd elements to 99.

foo.iloc[[0, 1, 2]] = 99

or with slicing

foo.iloc[:3] = 999

foo.loc[['a', 'b', 'c']] = 99

or with slicing

foo.loc['a':'c'] = 99

How to update a Series with an array¶

Suppose you have a Series foo and a NumPy array bar

foo = pd.Series([2, 3, 5, 7, 11], index=[2, 4, 6, 8, 10])
bar = np.array([5, 10, 15, 20, 25])

and your goal is to update foo's values with bar. If you overwrite foo, you'll lose its index.

foo = pd.Series(bar)
 
print(foo)
# 0     5
# 1    10
# 2    15
# 3    20
# 4    25
# dtype: int64

Instead, use slicing to overwrite foo's values without overwriting its index.

foo.iloc[:] = bar
 
print(foo)
# 2      2
# 4      3
# 6      5
# 8      7
# 10    11
# dtype: int64

How to update a Series with another Series¶

Suppose you have a Series x and a Series y whose indices are different but share a few common values.

x = pd.Series([10, 20, 30, 40])
y = pd.Series([1, 11, 111, 1111], index=[7,3,2,0])
 
print(x)
# 0    10
# 1    20
# 2    30
# 3    40
# dtype: int64
 
print(y)
# 7       1
# 3      11
# 2     111
# 0    1111
# dtype: int64

Predict the result of x.loc[[0, 1]] = y.

x.loc[[0, 1]] = y
 
print(x)
# 0    1111.0
# 1       NaN
# 2      30.0
# 3      40.0
# dtype: float64

you may be surprised..

Index Alignment

When you assign a Series y to a Series x, pandas uses index alignment to insert values from y into x based on matching index labels.

In the previous example, pandas starts by searching x for the values with index labels 0 and 1. Then it looks for matching labels in y to use to overwrite x. Since x's label 1 doesn't match any elements in y, pandas assigns it the value NaN. And since NaN only exists as a floating point value in NumPy, pandas casts the entire Series from ints to floats.

How to update a Series with a NumPy array¶

Given x and y from the previous section,

x = pd.Series([10, 20, 30, 40])
y = pd.Series([1, 11, 111, 1111], index=[7,3,2,0])
 
print(x)
# 0    10
# 1    20
# 2    30
# 3    40
# dtype: int64
 
print(y)
# 7       1
# 3      11
# 2     111
# 0    1111
# dtype: int64

If we do x.loc[[0, 1]] = y.to_numpy() we'll get the error:

ValueError: cannot set using a list-like indexer with a different length than the value

When you assign a NumPy array to a Series, pandas assigns the ith element of the array to the ith value of the Series.

In this case, x.loc[[0, 1]] = y.to_numpy() attempts to assign a 4-element array to a 2-element subseries, hence the error.

If we restrict the numpy array to its first two elements, the assignment works.

x.loc[[0, 1]] = y.to_numpy()[:2]  
 
print(x)
# 0     1.0
# 1    11.0
# 2    30.0
# 3    40.0
# dtype: float64

Series Basic Operations¶

It's important to understand how pandas handles basic operations between arrays. Here we'll look at addition, although the core concepts apply to other operations such as subtraction, multiplication, etc.

Adding a scalar to a Series¶

When you add a scalar to a Series, pandas uses broadcasting to add the scalar to each element of the Series.

x = pd.Series([1, 2, 3, 4])
x + 1
# 0    2
# 1    3
# 2    4
# 3    5
# dtype: int64

Adding a Series to a Series¶

Series arithmetic is fundamentally different from NumPy arithmetic. When you add two Series x and y, pandas only combines elements with the same index label.

x = pd.Series([1, 2, 3, 4])
y = pd.Series(1)
 
x + y
# 0    2.0
# 1    NaN
# 2    NaN
# 3    NaN
# dtype: float64

In this example, x has index labels 0, 1, 2, 3, and y has index label 0.

print(x)
# 0    1
# 1    2
# 2    3
# 3    4
# dtype: int64
 
print(y)
# 0    1
# dtype: int64

The result of x + y will be a Series whose index labels is a combination of x's index labels and y's index labels. In this case, the label 0 is in both Series, so the corresponding elements are added together. However, labels 1, 2, and 3 in x don't have matching elements in y, so Pandas converts these to NaN in the result. Since, NaN only exists as a floating point constant in NumPy (i.e. you can't have an integer array with NaNs), Pandas casts the entire Series from int64 to float64.

Add two Series' elements by position¶

If you want to add two Series' elements by position, convert them to NumPy arrays before adding them. For example,

A = pd.Series([10, 20, 30, 40, 50], index=[4, 3, 2, 1, 0])
B = pd.Series([1, 2, 3, 4, 5])
 
print(A)
# 4    10
# 3    20
# 2    30
# 1    40
# 0    50
# dtype: int64
 
print(B)
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# dtype: int64

If we add A + B, pandas uses index alignment to add elements by matching index label.

A + B
 
# 0    51
# 1    42
# 2    33
# 3    24
# 4    15
# dtype: int64

If we add the NumPy arrays underlying each Series, their elements are added by position.

A.to_numpy() + B.to_numpy()
# array([11, 22, 33, 44, 55])

To convert the resulting NumPy array back to a Series, just wrap it with pd.Series().

pd.Series(A.to_numpy() + B.to_numpy())
# 0    11
# 1    22
# 2    33
# 3    44
# 4    55
# dtype: int64

This technique drops A's index labels. If you want to retain A's labels, only convert B to an array.

A + B.to_numpy()
# 4    11
# 3    22
# 2    33
# 1    44
# 0    55
# dtype: int64

Add Series by label, prevent NaNs in the result¶

If you add two Series by index label, you'll often get NaNs in the result where an index label didn't exist in both Series.

x = pd.Series([1, 2, 3, 4])
y = pd.Series([10, 20], index=[1,3])
 
print(x)
# 0    1
# 1    2
# 2    3
# 3    4
# dtype: int64
 
print(y)
# 1    10
# 3    20
# dtype: int64
 
x + y
# 0     NaN
# 1    12.0
# 2     NaN
# 3    24.0
# dtype: float64

If you wish to add y to x by matching label without introducing NaNs in the result, you can use x.loc[y.index] to select elements of x with a matching index label in y, combined with += y.

x.loc[y.index] += y
 
print(x)  
# 0     1
# 1    12
# 2     3
# 3    24
# dtype: int64

Boolean Indexing¶

You can use a boolean Series x to subset a different Series, y via y.loc[x].

For example, given a Series of integers, foo,

foo = pd.Series([20, 50, 11, 45, 17, 31])
 
print(foo)
# 0    20
# 1    50
# 2    11
# 3    45
# 4    17
# 5    31
# dtype: int64

you can set mask = foo < 20 to build a boolean Series, mask, that identifies whether each element of foo is less than 20.

mask = foo < 20
 
print(mask)
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# 5    False
# dtype: bool

Then you can pass mask into foo.loc[] to select elements of foo which are less than 20.

foo.loc[mask]
# 2    11
# 4    17
# dtype: int64

Boolean Index Alignment

pandas uses index alignment to select elements in the target Series based on matching index label amongst elements in the boolean index Series whose value is True.

For example, if we shuffle mask's index (but not mask's values), foo.loc[mask] produces a different result.

mask.index=[0,1,3,2,4,5]
 
print(mask)
# 0    False
# 1    False
# 3     True
# 2    False
# 4     True
# 5    False
# dtype: bool
 
foo.loc[mask]
# 3    45
# 4    17
# dtype: int64

Boolean Indexing by Position¶

If you want to select elements from a Series based on the position of True values from another Series, convert the boolean index Series to a NumPy array.

x = pd.Series([10, 20, 30, 40, 50])
mask = pd.Series([True, True, False, False, False], index=[4,3,2,1,0])
 
# boolean index by label
x.loc[mask]
# 3    40
# 4    50
# dtype: int64
 
# boolean index by position
x.loc[mask.to_numpy()]
# 0    10
# 1    20
# dtype: int64

Combining Boolean Series¶

You can combine two boolean Series to create a third boolean Series. For example, given a Series of person ages

ages = pd.Series(
    data = [42, 43, 14, 18, 1],
    index = ['peter', 'lois', 'chris', 'meg', 'stewie']
)
 
print(ages)
# peter     42
# lois      43
# chris     14
# meg       18
# stewie     1
# dtype: int64

and a series of person genders

genders = pd.Series(
    data = ['female', 'female', 'male', 'male', 'male'],
    index = ['lois', 'meg', 'chris', 'peter', 'stewie'],
    dtype = 'string'
)
 
print(genders)
# lois      female
# meg       female
# chris       male
# peter       male
# stewie      male
# dtype: string

you can create a boolean Series identifying males younger than 18 like this.

mask = (genders == 'male') & (ages < 18)
 
print(mask)
# chris      True
# lois      False
# meg       False
# peter     False
# stewie     True
# dtype: bool

Attention!

When you combine two logical expressions in this way, each expression must be wrapped in parentheses. In this case, genders == 'male' & ages < 18 would raise an error.

Logical Operators¶

```
| x     | y     | x & y |
| ----- | ----- | ----- |
| True  | True  | True  |
| True  | False | False |
| False | True  | False |
| False | False | False |
```

```
| x     | y     | x | y |
| ----- | ----- | ----- |
| True  | True  | True  |
| True  | False | True  |
| False | True  | True  |
| False | False | False |
```

```
| x     | y     | x ^ y |
| ----- | ----- | ----- |
| True  | True  | False |
| True  | False | True  |
| False | True  | True  |
| False | False | False |
```

```
| x     | ~x    |
| ----- | ----- |
| True  | False |
| False | True  |
```

Missing Values (NaN)¶

You can use NaN to represent missing or invalid values in a Series.

NaN before pandas 1.0.0¶

Prior to pandas version 1.0.0, if you wanted to represent missing or invalid data, you had to use NumPy's special floating point constant, np.nan. If you had a Series of integers

roux = pd.Series([1, 2, 3])
 
print(roux)
# 0    1
# 1    2
# 2    3
# dtype: int64

and you set the second element to np.nan

roux.iloc[1] = np.nan
 
print(roux)
# 0    1.0
# 1    NaN
# 2    3.0
# dtype: float64

the Series would get cast to floats because NaN only exists in NumPy as a floating point constant.

NaN after 1.0.0¶

pandas' release of version 1.0.0 included a Nullable integer data type. If you want to make Series of integers with NaNs, you can specify the Series dtype as "Int64" with a capital "I" as opposed to NumPy's "int64" with a lower case "i".

roux = pd.Series([1, 2, 3], dtype='Int64')
 
print(roux)
# 0    1
# 1    2
# 2    3
# dtype: Int64

Now if you set the second element to NaN, the Series retains its Int64 data type.

roux.iloc[1] = np.nan
 
print(roux)
# 0       1
# 1    <NA>
# 2       3
# dtype: Int64

Note

A better way insert NaNs in modern pandas is to use pd.NA.

roux.iloc[1] = pd.NA

Pandas Nullable Data Types¶

pd.Series([True, pd.NA, False], dtype="boolean")
# 0     True
# 1     <NA>
# 2    False
# dtype: boolean

pd.Series([10, pd.NA, 30], dtype="Int64")
# 0      10
# 1    <NA>
# 2      30
# dtype: Int64

pd.Series([1.2, pd.NA, 3.4], dtype="Float64")
# 0     1.2
# 1    <NA>
# 2     3.4
# dtype: Float64

pd.Series(["dog", pd.NA, "cat"], dtype="string")
# 0     dog
# 1    <NA>
# 2     cat
# dtype: string

NaN Tips and Tricks¶

Given a Series, x, with some NaN values,

x = pd.Series([1, pd.NA, 3, pd.NA], dtype='Int64')
 
print(x)
# 0       1
# 1    <NA>
# 2       3
# 3    <NA>
# dtype: Int64

You can use pd.isna() to check whether each value is NaN.

pd.isna(x)
# 0    False
# 1     True
# 2    False
# 3     True
# dtype: bool

You can use pd.notna() to check whether each value is not NaN.

pd.notna(x)
# 0     True
# 1    False
# 2     True
# 3    False
# dtype: bool

If you want to replace NaN values in a Series with a fill value, you can use the Series.fillna() function.

# replace NaNs with -1
x.fillna(-1)  
# 0     1
# 1    -1
# 2     3
# 3    -1
# dtype: Int64

Boolean Indexing with NaN¶

It's important to understand how NaNs work with boolean indexing.

Suppose you have a Series of integers, goo, and a corresponding Series of booleans, choo, with some NaN values.

goo = pd.Series([10,20,30,40])
choo = pd.Series([True, False, pd.NA, True])

If you attempt to index goo with choo, Pandas throws an error.

goo.loc[choo]

"ValueError: Cannot mask with non-boolean array containing NA / NaN values"

Notice that choo has dtype 'object'.

print(choo)
# 0     True
# 1    False
# 2     <NA>
# 3     True
# dtype: object

This happens because pandas relies on NumPy's handling of NaNs by default, and NumPy doesn't "play nicely" with NaN values unless you happen to be working with an array of floats. In this case, dtype='object' is an indicaiton that the underlying numpy array is really just a Series of pointers.

To overcome this issue, we can rebuild choo with dtype = "boolean".

choo = pd.Series([True, False, np.NaN, True], dtype = "boolean")
 
print(choo)
# 0     True
# 1    False
# 2     <NA>
# 3     True
# dtype: boolean

Now the boolean index goo.loc[choo] returns a 2-element subSeries as you might expect.

goo.loc[choo]  
# 0    10
# 3    40
# dtype: int64

In this case, the NaN value in choo is essentially ignored.

Note that the negation of NaN is NaN, so goo.loc[~choo] does not return the compliment of goo.loc[choo].

goo.loc[~choo]  
# 1    20
# dtype: int64