What's a DataFrame?¶
A pandas DataFrame is a table of data. More specifically, it's a collection of 1-D arrays of the same length, that share a row index and a column index.
DataFrame is similar to a 2-D NumPy array, but with DataFrame, each column can store a different type of data. For example, here's a DataFrame with a column of ints, a column of strings, and a column of Dates.
DataFrame Creation¶
DataFrame From Dictionary¶
Perhaps the easiest way to make a DataFrame from scratch is to use the DataFrame()
constructor, passing in a dictionary of column_name:columns_values
pairs.
Example:
DataFrame From List Of Lists¶
You can build a DataFrame from a list of lists where each inner list represents a row of the DataFrame.
DataFrame Inspection¶
DataFrame has a number of handy tools for inspection..
DataFrame.info()
¶
reports information about the DataFrame's structure.
DataFrame.shape
¶
returns a tuple with the number of rows and columns in the DataFrame.
DataFrame.axes
¶
returns a list with the DataFrame's row index and column index.
DataFrame.size
¶
returns the total number of elements in the DataFrame.
To And From CSV¶
DataFrame.to_csv()
¶
To write a DataFrame to CSV, use the DataFrame.to_csv()
method.
By default, to_csv()
writes the DataFrame to your current working directory. You can check your current working directory using the getcwd()
function from os
.
Alternatively, you can write the DataFrame to a specific file path using the path_or_buf
parameter.
Row Index Column¶
By default, to_csv()
includes the row index in the output. For example, df.to_csv('pets.csv')
above would generate a CSV file that looks like this.
Notice the nameless column at the front. If you want to exclude the row index, set index=False
within to_csv()
.
pandas.read_csv()
¶
You can create a DataFrame from a CSV file using the read_csv()
function, passing in the name of the file.
In the real world, reading CSV files doesn't always go smoothly. Sometimes you'll have to steer read_csv()
in the right direction using some parameters like
sep
to specify a value separator if your file is something other than comma delimitedheader
to tell pandas if your file contains column namesindex_col
to indicate which column if any should be used as the row indexusecols
to tell pandas "only read a certain subset of columns"
Basic Indexing¶
Indexing a DataFrame is very similar to indexing a Series..
Indexing Columns¶
Access column as a Series¶
To access a DataFrame column as a Series, reference its name using square bracket notation or dot notation.
Warning
Dot notation only works if the column name has alphanumeric characters only. If the name has a space, tilde, etc. you must use square bracket notation.
Info
If you'd like to get the result as a one-column DataFrame instead of a Series, use square bracket indexing with the column name wrapped inside a list.
Access columns by name¶
You can access one or more columns using square bracket notation, passing in a list of column names.
The result is a DataFrame whose data is a copy of the original DataFrame.
Access columns by position¶
You can also use DataFrame.iloc
to select columns by position. For example, df.iloc[:, [0, 1]]
returns every row of the DataFrame, but only columns 0 and 1.
df.iloc[[0, 2], [1, 2]]
returns rows 0 and 2 with columns 1 and 2.
Indexing Rows¶
Access rows by position¶
If you want to pick out certain rows of a DataFrame by their position, you can use DataFrame.iloc[]
, very similar to Series.iloc[]
.
For example, to get the 1st, 3rd, and 5th rows of df
, you could do
Alternatively, you can use slicing.
If you select an individual row with iloc
like this
the result is a Series, not a DataFrame.
To fetch an individual row of a DataFrame as a DataFrame, you pass in a list of indices.
Access rows by index label¶
You can use DataFrame.loc[]
to access rows of a DataFrame by index label. For example, given a DataFrame whose row index is the letters 'a' through 'f',
we can select the rows with index labels b and e using df.loc[['b', 'e']]
.
We can combine row and column indexing to select rows a, c, and f with columns crab and shrimp.
And we can even use slicing to pick out every row between b and e for column crab.
Boolean Indexing¶
You can use boolean indexing to select rows of a DataFrame, just as you can with Series.
To get the rows of df
where shrimp is less than 40, start by building a boolean Series like this.
Notice it has the same index as df
; therefore we can pass it into df.loc[]
to select rows where shrimp is less than 40.
More commonly you'll see this as a one-liner.
Just like Series, you can combine logical conditions to create more intricate filters.
Access rows by position, columns by name¶
If you wanted to select the first three rows of df
with the columns shrimp
and red fish
you might try
or
but both of these techniques fail.
df.loc
expects label indexers, so the positional indexer:3
causes an error.df.iloc
expects positional indexers, so the label indexer['shrimp', 'red fish']
causes an error.
One solution is to convert the column names 'shrimp'
, and 'red fish'
to their corresponding positional indices 0
and 2
, and then use iloc
as usual.
To make this dynamic, we can replace [0, 2]
with [df.columns.get_loc(c) for c in ['shrimp', 'red fish']]
.
Note that df.columns
returns the column index of the DataFrame.
Basic Operations¶
Inserting columns¶
You can insert a new column 'c'
into an existing DataFrame df
via df['c'] = x
where x
is either a list, Series, NumPy array, or a scalar.
Note
You can't use dot notation to create a new column. E.g. you can't do df.d = 1
You can also combine columns to create a new column. For example,
Updating values¶
You can create or update column values using boolean indexing. For example, given the following DataFrame,
we can update a
to equal 0 where b
is 'rabbit' as follows
Removing columns¶
To remove columns from a DataFrame, use DataFrame.drop()
passing in a list of column names to the columns
argument.
Changing column names¶
To change a DataFrame's column names, use the DataFrame.rename()
method passing in a dictionary of old_name:new_name
pairs.
For example, here we change the column name 'age'
to 'years'
.
DataFrame.apply()
¶
DataFrame's apply()
method lets you apply a function to each row or column in a DataFrame. apply()
has two primary arguments:
func
: tellsapply()
what function to applyaxis
: tellsapply()
whether to apply the function to each column (axis=0
) or each row (axis=1
)
For example, given the following DataFrame
calling df.apply(func=np.sum, axis=0)
sums the data in each column.
Notice the result is a 2-element Series whose index labels are the column names of df
.
The same operation with axis=1
generates a 3-element Series with the sum of each row.
apply() with function arguments¶
Now suppose we have a DataFrame called kids
with mixed column types..
Our goal is to determine whether each child should be allowed in a haunted house. To be allowed inside,
- you must be at least 12 years old or
- you must have adult supervision
apply()
works great for tasks like this.
We start by making a function called is_allowed()
that inputs a number, age
, and a boolean, with_adult
, and returns a boolean indicating whether that kid is allowed to enter the haunted house.
To apply this function to each row, you may be inclined to try
but this fails because pandas doesn't know which columns to use for the age
and with_adult
parameters of the is_allowed()
function.
It's important to understand that the input to the function is a Series representation of the "current row". For example, the Series representation of the first row of kids
looks like this
Passing this result into the is_allowed()
function fails for numerous reasons.
-
TypeError: is_allowed() missing 1 required positional argument: 'with_adult'
To overcome this, we create a wrapper function for is_allowed()
using lambda
as follows.
Tacking that onto our kids
DataFrame, we can see exactly who's allowed in the haunted house.
- Beat it, Jimmy!
Merging DataFrames¶
The workhorse function to merge DataFrames together is merge()
.
To motivate its use, suppose we have the following two tables of data:
pets
: one row per pet. The index represents eachpet_id
visits
one row per visit. The index represents eachvisit_id
merge()
has four types of joins: left, right, inner, and outer, with the default being inner. The join type determines which keys show up in the result.
Inner Join¶
The result of an inner join only includes keys that exist in both the left and right tables. (A single key can appear multiple times in the result if it appears multiple times in one of the input tables.)
Here we,
- designate
pets
as the left table - designate
visits
as the right table - set the join type as inner
- set
left_index=True
because in the left table (pets
), pet ids are stored in the row index - set
right_on='pet_id'
because in the right table (visits
), pet ids are stored in a column called pet_id
Notice:
pet_id
11 is excluded because it only exists in thepets
table andpet_id
31 is excluded because it only exists in thevisits
table.pet_id
42 occurs six times in result because it occurred three times in thevisits
table, each of which matched two occurrences in thepets
table.
When your DataFrame's row index has some specific meaning (kind of like ours), it's a good practice to give it a name. Here we rename the row index of pets
to pet_id.
We'll also rename the row index of visits to visit_id.
A convenient benefit to naming the pet_id
index is that we can rebuild the inner join from before using on='pet_id'
instead of setting left_index=True
and right_on='pet_id'
.
When you use the on
parameter like this, pandas searches both tables for a matching column name and/or row index to merge the tables with. Also notice that this result is slightly different from our previous version - specifically the resulting row index is different and the resulting column order is different. I'll touch on that a bit later.
Left Join¶
The result of a left join retains all and only the keys in the left table. Let's see it in action, again making pets
the left table and visits
the right table, joining on pet_id
.
Notice:
- The result includes
pet_id
11 which exists in the left table (pets
) but not the right table (visits
). pet_id
42 occurs six times in the result because its three instances in the left table each matched to two instances in the right table.
Right Join¶
A right join is the same as a left join, except the right table's keys are preserved.
In this case, pet_id
11 is excluded but pet_id
31 is retained in the result.
Outer Join¶
The last join type supported by merge()
is an outer join which includes every key from both tables in the output.
Anti-join¶
Unfortunately merge()
doesn't support anti-join which answers the question
Which records from table
A
don't match any records from tableB
?
However, cooking up an anti-join is not terribly difficult. Observe the following two techniques (and their differences).
Anti-Join Method 1¶
Suppose we want to see which records in the pets
table don't have a matching record in the visits
table via pet_id
. We can start by doing an outer join with the indicator=True
.
The result includes a column named '_merge' that indicates if the corresponding record came from the left table, the right table, or both tables. Filtering this result where _merge == 'left_only'
identifies pets
that don't exist in visits
. This equates to an anti-join between pets
and visits
on pet_id
.
Warning
This technique is simple but memory inefficient, as the outer join creates a large intermediate table.
Anti-Join Method 2¶
The second method is a little trickier, but it's more memory efficient. The basic idea is, for each pet_id
in the pets table, we check if it exists in the visits
table using the isin()
method. Then we negate the result and use that to subset the pets
table.
So if we do pets.index.isin(visits.pet_id)
we get back a boolean array indicating whether each pet_id
in the pets
table also exists in the visits
table.
Then we can negate that array and use it as a boolean index to subset the pets
table.
Since the visits
table includes duplicate pet_ids
, we can slightly improve this by using only its unique pet_id
s.
Aggregation¶
DataFrame's agg()
method lets you aggregate the rows of a DataFrame to calculate summary statistics.
For example, given the following DataFrame,
we can calculate the sum of each column using df.agg('sum')
.
Or we could pass in a function callabe, like np.sum
.
Aggregation is intended to be used with a special class of functions that input a Series of values and reduces them down to a single value, or at least fewer values than the input. These include functions like sum()
, mean()
, min()
, and max()
.
By contrast, transformation functions like sort()
, fillna()
, and round()
input a Series of values and return another Series the same length as the input.
When you pass a string into df.agg()
like df.agg('foofunc')
, pandas
- first looks for a DataFrame method called
'foofunc'
. If it can't find anything, it - then checks for a numpy function called
'foofunc'
. If it still can't find'foofunc'
, it - then raises an error.
If you defined your own function named 'foofunc'
, you can use it with df.agg()
, but you have to pass it in as a callable.
The result of df.agg('sum')
was a 2-element Series with the sum of each column. To return the result as a DataFrame, wrap the aggregation function inside a list.
Using a list, we can aggregate the data using multiple functions. For example, to get the sum and mean of each column,
To calculate the sum and mean of column x
, and the min and max of column y
, we can use a
dictionary of column:list-of-functions
pairs to tell pandas exactly what functions to apply to each column.
DataFrame.agg()
includes an axis
argument. By default, it's set to 0 which tells pandas to aggregate the DataFrame's rows. This means an operation like df.agg('min')
calculates column mins.
By contrast, df.agg('min', axis=1)
aggregates the DataFrame's columns, calculating row mins.
DataFrame.agg()
also includes *args
and **kwargs
parameters which can be used to pass constants into the aggregation function(s). For example, if you define some custom function like nth_value()
that returns the nth value of a Series,
you could aggregate df
, selecting the 3rd value in each column as follows.
Tip
For simple functions like this, you'll often see people build them on the fly using lambda.
Group By¶
You can use groupby()
to partition a DataFrame or Series into groups and subsequently aggregate or transform the data in each group.
Setup
Every group by operation starts with a call to the groupby()
method.
Note
groupby()
exists both as a Series method and as a DataFrame method.
Perhaps the most important parameter of groupby()
is the by
argument, which tells pandas how to split the DataFrame (or Series) into groups. Usually, you'll specify one or more column names here. For example, df.groupby(by='A')
tells pandas to partition the data based on unique values in column A
.
In this case, pandas partitions the data into two groups like this
Alternatively, you could do df.groupby(by=['A', 'B'])
to partition the rows based on the set of unique (A, B)
pairs.
In this case, pandas partitions the data into these three groups.
The result of calling DataFrame.groupby()
is a special type of object called a DataFrameGroupBy
, and the result of calling Series.groupby()
is a SeriesGroupBy
object. Both are extensions of a generic GroupBy
class, so they behave very similarly.
GroupBy
objects have some special features and attributes like .groups
which returns a dictionary of group-key:list-of-input-rows
pairs.
For example, if we group df
by A
and B
, groups_ab.groups
tells us that rows 0 and 2 correspond to the key ('foo', False).
Similarly, .ngroup()
tells us the numeric group id associated with each row of the input data.
Notice that group ids are ordered by the order of their keys.
For example, group 0 corresponds to key ('bar', True) because alphabetically 'bar' comes before 'foo', but in the original DataFrame, the key ('foo', False) occurs first.
If you'd group ids be ordered by the first occurrence of each key, use groupby(sort=False)
.
We can pass a Series into the by
argument of groupby()
, in which case:
- the Series values determine the groups and
- the Series index determines which rows of the DataFrame map to each group
For example, if we wanted to group the data based on the whole part of the values in column C
, we could:
-
build a Series with the whole part of the values in column C
-
use that Series to partition
df
into groups
In this case, the values in the Series get partitioned into five groups. Then the index labels of the Series tell pandas how to map the rows of df
into those five groups. Index 0 goes to group 1, index 1 goes to group 0, index 2 goes to group 2, and so on.
Note
Typically this would be written as a one liner..
GroupBy Aggregate¶
We've already seen how to aggregate an entire DataFrame. Aggregating a GroupBy object is essentially the same thing, where the aggregation applies to each group of data and the results are combined into a DataFrame or Series.
For example, here we group df
by column A
and aggregate the groups taking the sum of column C
.
Breakdown
-
df.groupby('A')
creates aDataFrameGroupBy
object. -
Tacking on
['C']
isolates columnC
, returning aSeriesGroupBy
object. -
Appending
.sum()
takes group sums.
The result is a Series whose index labels are the group keys and whose values are the group sums.
Tip
Keep in mind, DataFrameGroupBy
and SeriesGroupBy
are both derived from a generic GroupBy
class. The documentation for GroupBy identifies sum()
as one of many available aggregation
methods.
Here's a similar solution, except this time we wrap C
into a list.
In this case, pandas does not convert the DataFrameGroupBy
object to a SeriesGroupBy
object. Hence, the result is a DataFrame of group sums instead of a Series of group sums. This is analogous to selecting a column with df['C']
(which returns a Series) versus df[['C']]
(which returns a DataFrame).
We could use this technique to calculate group sums for multiple columns. For example,
We could replace sum()
here with agg()
and pass in a method name or function callable.
Basically all the rules for DataFrame aggregation are applied here. This means you can even do complex operations like get the sum of C
and the mean and variance of D
for each group in A
.
MultiIndex
The resulting DataFrame here has something called a MultiIndex for its columns. Similarly, df.groupby(by=['A', 'B'])[['D']].mean()
returns a DataFrame with a row MultiIndex. We'll discuss MultiIndexes later.
If we group df
by B
and D
, taking the min and max of C
, you'll notice that pandas ignores and drops the NaNs inside D
.
If you want NaNs to be included in the result, set dropna=False
within the call to groupby()
.
Renaming the output columns¶
One trick to rename the output columns of a groupby aggregate operation is to pass named tuples into the agg()
method where the keywords become the column names.
GroupBy Transform¶
In a groupby transform operation, the output object has the same number of rows as the input object. For example, suppose we want sort the values in column C
of df
, but only relative to each group in A
. We can achieve this with the following groupby transform operation.
Breakdown
-
df.groupby('A')['C']
groups the data by columnA
and then isolates columnC
. -
.transform(pd.Series.sort_values)
sorts the values in columnC
, per group.
The result of this operation is a new Series whose index matches the row index of the original DataFrame. This is useful, for example, if you wanted to tack this on as a new column of df
or overwrite the existing values in column C
. For example,
The transform()
method takes one primary argument - the function you want to apply. In this case, sort_values()
isn't a native GroupBy method, so we have to pass it in as a function callable.
As you probably guessed, if we do the same exact thing, but we wrap 'C'
inside a list, we get back a DataFrame with essentially the same exact data.
Let's look at another example where again we group by column A
, but this time we calculate the mean of columns C
and D
.
Breakdown
-
Like before,
df.groupby('A')
creates a DataFrameGroupBy object. -
df.groupby('A')[['C', 'D']]
selects columns C and D giving us a slightly different DataFrameGroupBy object. -
.transform(mean)
calculates the mean of C and D within each group. However, the resulting means are expanded to match the shape of the input groups.Even though mean is an aggregation function, our usage of transform (as opposed to aggregate) tells pandas to expand the means into a DataFrame with the same number of rows as
df
.