What's a Series?¶
Series is one of the fundamental data structures in pandas. It's essentially an array with an index. Because it's an array, every value in a Series must be of the same type. You can have a Series of ints, a Series of floats, or a Series of booleans, but you can't have a Series of ints, floats and booleans together.
Series Documentation¶
You'll want to familiarize yourself with pandas' documentation. Here's the documentation for Series. It's the first place you should look when you have questions about a Series or Series method.
Series Creation¶
How to make a Series from a list¶
The easiest way to make a series is from a list.
If we print the series, we get back something like this
Notice how it already looks a bit different from a NumPy array. The column of values on the left is the Series index which you can use to access the Series elements in creative and meaningful ways. More on that later..
Also notice the output includes 'dtype int64' which tells us the data type of the elements in the Series.
How to check if an object is a Series¶
You can use Python's type()
function to check that x
is indeed a Series object.
How to check the type of data stored in a Series¶
If you want to check the internal data type of the Series elements without printing the whole Series, you can use the
Series.dtype
attribute.
How to access the underlying NumPy array¶
Most pandas Series store the underlying data as a NumPy array. You can access the underlying NumPy array via
Series.to_numpy()
.
You might also see people using the Series.values
attribute here, but this technique
is not recommended.
How to access the first N elements of a Series¶
You can use the highly popular Series.head()
method to pick out the first N elements of
a Series. For example, x.head(6)
returns the first 6 elements of x
as a new Series.
How to access the last N elements of a Series¶
You can use Series.tail()
to pick out the last N elements of a Series.
For example, x.tail(3)
returns the last 3 elements of x
as a new Series.
How to make a Series from a dictionary¶
You can make a Series from a python dictionary, like this
In this case, pandas uses the dictionary keys for the series index and the dictionary values for the series values. Again, we'll cover the index and its purpose shortly. For now, just know it's a thing.
How to make a Series of strings¶
If we wanted to make a Series of strings, we could do that too.
If we print(z)
, notice the dtype
is listed as "object".
Why?
The short answer is, this is not a Series of strings. Rather, this is a Series of pointers. Since strings are objects that vary in size, but arrays (and thus Series) use fixed-size memory blocks to store their data, pandas implements a common trick - store the strings randomly in memory and put the address of each string in the underlying array. (Memory addresses are fixed-size objects - usually just 64-bit integers). If you're confused by this - don't worry, it's a tricky concept that'll make more sense later on.
The newer and better approach to creating a Series of strings is to specify dtype='string'
.
Now when we print(z)
, pandas reports the dtype as 'string'.
(There's a lot to discuss here, but we'll cover these things later.)
How to make a Series from a NumPy array¶
Perhaps the most powerful way to make a Series from scratch is to make it from a NumPy array.
If you have a NumPy array like this
you can convert it to a Series just by passing x
into pd.Series()
Why is this so "powerful"?
Well, suppose you wanted to make a complex Series from scratch like a random sample of values from a normal distribution. The somewhat lame, but practical way to do this is to use NumPy. NumPy has lots of great tools for making arrays from scratch, and converting them into a Series is a piece of cake 🍰.
Is your NumPy rusty?
Check out our NumPy problem set
Series Basic Indexing¶
Suppose we have the following Series, x
.
If you wanted to access the ith element of the Series, you might be inclined to use square-bracket indexing notation just like accessing elements from a Python list or a NumPy array.
x[0]
returns the 1st element, x[1]
returns the 2nd element and so on.
This appears to work like List indexing, but don't be fooled! x[0]
actually returns the element(s) of the Series with index label 0. In this example, that element happens to be the first element in the Series, but if we shuffle the index like this
now x[0]
returns 20 instead of 5.
However, if we change the index to ['a','b','c','d','e']
This time, x[0]
does return the first value in the Series.
Caution
The takeaway here is that square-bracket indexing in pandas isn't straight-forward. Its behavior changes depending on characteristics of the Series. For this reason, we recommend using more explicit indexing techniques - Series.iloc
and Series.loc
.
Indexing by position¶
How to access the ith value of a Series¶
Use the Series.iloc
property to access the ith value in a Series.
Negative Indexing¶
Series.iloc
supports negative indexing like Python lists and NumPy arrays.
Positional Slicing¶
Series.iloc
supports negative indexing like Python lists and NumPy arrays.
Notice the result is a Series object whereas in the previous examples the results were scalars.
How to select multiple elements by position¶
Series.iloc
can receive a list, array, or Series of integers to select multiple values in x
.
Indexing by label¶
Let's talk about the index. Every Series has an index and its purpose is to provide a label for each element in the Series. When you make a Series from scratch, it automatically gets an index of sequential values starting from 0.
For example, here we make a Series to represent the test grades of five students, and you can see how the index automatically gets created.
We can change the index pretty easily, just by setting it equal to another array, list, or Series of values with the proper length. The index values don't even need to be integers, and in fact, they're often represented as strings.
How to access the value of a Series with label¶
To fetch a Series value(s) with some specific label, use the Series.loc
method.
For example, to get bart's grade in the Series above, we can do grades.loc['bart']
.
Label Slicing¶
Series.loc
supports slicing by label. For example, to fetch the grades between homer and grandpa, we could do grades.loc['homer':'grandpa']
.
Warning
Notice that the slice 'homer':'grandpa'
includes homer and grandpa. By contrast, the equivalent positional slice 0:2
would exclude the right endpoint (grandpa).
How to select multiple elements by label¶
Just like Series.iloc[]
, we can pass a list, array, or Series of labels into Series.loc[]
to retrieve multiple elements.
RangeIndex¶
When you make a Series without specifying its index, pandas automatically gives it a RangeIndex.
By contrast, when you explicitly set the index as a list of integers, pandas gives it an Int64Index.
For most situations, the difference is irrelevant. However, note that the RangeIndex is more memory efficient and has faster access times.
Modifying Series Data¶
Consider this Series foo
.
Basic Series Modifications¶
We can change the second element to 200.
We can set the 1st, 2nd and 3rd elements to 99.
or with slicing
or with slicing
How to update a Series with an array¶
Suppose you have a Series foo
and a NumPy array bar
and your goal is to update foo
's values with bar
. If you overwrite foo
, you'll lose its index.
Instead, use slicing to overwrite foo
's values without overwriting its index.
How to update a Series with another Series¶
Suppose you have a Series x
and a Series y
whose indices are different but share a few common values.
Predict the result of x.loc[[0, 1]] = y
.
you may be surprised..
Index Alignment
When you assign a Series y
to a Series x
, pandas uses index alignment to insert values from y
into x
based on matching index labels.
In the previous example, pandas starts by searching x
for the values with index labels 0 and 1. Then it looks for matching labels in y
to use to overwrite x
. Since x
's label 1 doesn't match any elements in y
, pandas assigns it the value NaN. And since NaN only exists as a floating point value in NumPy, pandas casts the entire Series from ints to floats.
How to update a Series with a NumPy array¶
Given x
and y
from the previous section,
If we do x.loc[[0, 1]] = y.to_numpy()
we'll get the error:
ValueError: cannot set using a list-like indexer with a different length than the value
When you assign a NumPy array to a Series, pandas assigns the ith element of the array to the ith value of the Series.
In this case, x.loc[[0, 1]] = y.to_numpy()
attempts to assign a 4-element array to a 2-element subseries, hence the error.
If we restrict the numpy array to its first two elements, the assignment works.
Series Basic Operations¶
It's important to understand how pandas handles basic operations between arrays. Here we'll look at addition, although the core concepts apply to other operations such as subtraction, multiplication, etc.
Adding a scalar to a Series¶
When you add a scalar to a Series, pandas uses broadcasting to add the scalar to each element of the Series.
Adding a Series to a Series¶
Series arithmetic is fundamentally different from NumPy arithmetic. When you add two Series x
and y
, pandas only combines elements with the same index label.
In this example, x
has index labels 0, 1, 2, 3, and y
has index label 0.
The result of x + y
will be a Series whose index labels is a combination of x
's index labels and y
's index labels. In this case, the label 0 is in both Series, so the corresponding elements are added together. However, labels 1, 2, and 3 in x
don't have matching elements in y
, so Pandas converts these to NaN in the result. Since,
NaN only exists as a floating point constant in NumPy (i.e. you can't have an integer array with NaNs), Pandas casts the entire Series from int64
to float64
.
Add two Series' elements by position¶
If you want to add two Series' elements by position, convert them to NumPy arrays before adding them. For example,
If we add A + B
, pandas uses index alignment to add elements by matching index label.
If we add the NumPy arrays underlying each Series, their elements are added by position.
To convert the resulting NumPy array back to a Series, just wrap it with pd.Series()
.
This technique drops A
's index labels. If you want to retain A
's labels, only convert B
to an array.
Add Series by label, prevent NaNs in the result¶
If you add two Series by index label, you'll often get NaNs in the result where an index label didn't exist in both Series.
If you wish to add y
to x
by matching label without introducing NaNs in the result, you can use x.loc[y.index]
to select elements of x with a matching index label in y
, combined with += y
.
Boolean Indexing¶
You can use a boolean Series x
to subset a different Series, y
via y.loc[x]
.
For example, given a Series of integers, foo
,
you can set mask = foo < 20
to build a boolean Series, mask
, that identifies whether each element of foo
is less than 20.
Then you can pass mask
into foo.loc[]
to select elements of foo
which are less than 20.
Boolean Index Alignment
pandas uses index alignment to select elements in the target Series based on matching index label amongst elements in the boolean index Series whose value is True
.
For example, if we shuffle mask
's index (but not mask
's values), foo.loc[mask]
produces a different result.
Boolean Indexing by Position¶
If you want to select elements from a Series based on the position of True values from another Series, convert the boolean index Series to a NumPy array.
Combining Boolean Series¶
You can combine two boolean Series to create a third boolean Series. For example, given a Series of person ages
and a series of person genders
you can create a boolean Series identifying males younger than 18 like this.
Attention!
When you combine two logical expressions in this way, each expression must be wrapped in parentheses. In this case, genders == 'male' & ages < 18
would raise an error.
Logical Operators¶
Missing Values (NaN)¶
You can use NaN to represent missing or invalid values in a Series.
NaN before pandas 1.0.0¶
Prior to pandas version 1.0.0, if you wanted to represent missing or invalid data, you had to use NumPy's special floating point constant, np.nan
. If you had a Series of integers
and you set the second element to np.nan
the Series would get cast to floats because NaN
only exists in NumPy as a floating point constant.
NaN after 1.0.0¶
pandas' release of version 1.0.0 included a
Nullable integer data type. If you want to make Series of integers with NaNs, you can specify the Series dtype
as "Int64" with a capital "I" as opposed to NumPy's "int64" with a lower case "i".
Now if you set the second element to NaN
, the Series retains its Int64 data type.
Note
A better way insert NaNs in modern pandas is to use pd.NA
.
Pandas Nullable Data Types¶
NaN Tips and Tricks¶
Given a Series, x
, with some NaN values,
You can use pd.isna()
to check whether each value is NaN.
You can use pd.notna()
to check whether each value is not NaN.
If you want to replace NaN values in a Series with a fill value, you can use the Series.fillna()
function.
Boolean Indexing with NaN¶
It's important to understand how NaNs work with boolean indexing.
Suppose you have a Series of integers, goo
, and a corresponding Series of booleans, choo
, with some NaN values.
If you attempt to index goo
with choo
, Pandas throws an error.
"ValueError: Cannot mask with non-boolean array containing NA / NaN values"
Notice that choo
has dtype 'object'.
This happens because pandas relies on NumPy's handling of NaNs by default, and NumPy doesn't "play nicely" with NaN values unless you happen to be working with an array of floats. In this case, dtype='object' is an indicaiton that the underlying numpy array is really just a Series of pointers.
To overcome this issue, we can rebuild choo
with dtype = "boolean"
.
Now the boolean index goo.loc[choo]
returns a 2-element subSeries as you might expect.
In this case, the NaN value in choo
is essentially ignored.
Note that the negation of NaN is NaN, so goo.loc[~choo]
does not return the compliment of goo.loc[choo]
.