You can insert a prefix or suffix to each element in a Series of strings by adding a string prefix or string suffix directly to the Series with the + operator.
The workhorse datetime type in pandas is Timestamp which is really just a wrapper
of NumPy's datetime64 type. Similar to NumPy, pandas doesn't have a proper date type, so if you want to represent a date like "January 3rd, 2001", you'd typically use the TimeStamp 2022-01-03 00:00:00.
Note that unlike NumPy, pandas Timestamps are always stored with nanosecond units, so you can't create a Timestamp before 1678 or after 2262. If you try it, you'll get an error
The to_datetime() smartly interprets various datetime formats. For example,
Some formats cannot be automatically interpreted. Furthermore, it's better to explicitly state datetime formats. To do
this, pass in a strptime format using the format parameter.
If we pass a list of datetime strings into the to_datetime() function, we get back a DatetimeIndex
that we could use as the index of a Series or DataFrame.
If we wrap that with pd.Series(), we get back a Series of datetime64s.
You can use the date_range() function to build a sequence of datetimes based on a regular interval. For example, you can make a DatetimeIndex with 10 values starting on March 8th 2020, iterating by 12 hours like so.
If we did the same thing with tz='America/Denver', notice the output changes slightly because Daylight Saving in Denver started at 2AM on March 8th in 2020, meaning the clock in Denver went from 1:59AM to 3:00AM, basically skipping the whole 2 O'clock hour.
One of pandas' best features is its ability to represent categorical data using the Categorical type, much like the factor data type in R. A Categorical takes on a limited, and usually fixed number of possible values (i.e. categories).
Setup Suppose we have a dataset representing four cars. For each car, we have its VIN number, color, and size classification:
Note that this is not a Series, but you could make it a Series just by wrapping it in pd.Series(colors).
By default, when you build a categorical, pandas sets the categories as the unique, non NaN values in the data. To explicitly define the categories, use the categories argument. This is especially useful if your collection includes categories not observed in the data.
The categories parameter also lets you organize the order in which categories should be displayed, which could be handy for things like plots or reports. Without specifying the categories parameter, pandas displays them in lexical (alphabetical) order. Alternatively, you can define the categories in a specific order. For example, here we specify colors in the order "bright to dark".
sizes is similar to colors, but unlike colors, sizes have an inherent order. 'mini' is less than 'standard'
is less than 'extended'. To define an ordered Categorical,
pass the ordered categories into the categories argument and
set ordered=True
With an ordered Categorical, you can do things like compare sizes < 'extended' and get back a boolean array.
Note
Categoricals don't have .loc or .iloc accessors. To subset sizes as those less than 'extended', you must use
square bracket notation, like indexing a numpy array.
A MultiIndex, or hierarchical index, is an index that uses multiple values in its keys.
There are a few different ways to make a MultiIndex from scratch. Perhaps the most intuitive way is to use the pd.MultiIndex.from_tuples() constructor, passing in a list of tuples. We'll do that here to generate a MultiIndex of (store, product) pairs.
Now let's take a look at a groupby aggregate operation since this is a more common way MultiIndexes arise in practice. We'll start by building a DataFrame with 7 rows and 4 columns.
Then we'll set stew equal to the following groupby aggregate operation.
In this case, the row index is a MultiIndex
and so is the column index.
We can use DataFrame.xs() to select all columns representing a sum.
We could select D columns by specifying key='D' and level=0.
There are lots of ways to flatten the columns of a MultiIndex into a regular index. Perhaps the easiest is to use the to_flat_index() method. For example,
This returns a generic Index object which behaves similarly to a list of tuples. With basic list comprehension, we can convert this to a list of strings
... which we can set equal to stew.columns. This flattens stew's column MultiIndex.
There's a good chance you'll want to reshape like this
Here, we converted the data from long format to wide format. Some people would say we converted the data from stacked format to unstacked format. For better or worse, there are a ton of ways to do this with pandas. Perhaps the simplest method is to use the pivot() method of a DataFrame...
We can reshape it to wide format using df.pivot(), telling pandas which columns we want to use for the row index, column index, and values.
Now suppose df had slightly different data..
This time, df.pivot() generates an error.
The error occurs because there are two instances where multiple values in the input map to the same exact position in the output, and we haven't told pandas how to deal with that. This is where the more general pivot_table() method comes in..
pivot_table() works just like pivot(), but you can specify an aggregation function to tell Pandas how to deal with multiple values mapping to the same place.
For example, here we convert df from tall to wide format using pivot_table() with aggfunc=list.
Notice that 67 and 9 mapped to the same cell in the output because they had the same row and col values in the input. aggfunc tells Pandas how to deal with those values, in this case wrapping them into a list. Also, note that the combinations (row0, col0) and (row1, col1) didn't exist in df, so Pandas fills those entries with NaN by default.
You're probably not interested in pivoting data just to build lists of values. More interestingly, you can use aggregation functions like mean(), sum(), size(), count(), or a combination of all of them.
This works just like doing normal DataFrame aggregation. So, for example, to get the count and sum of elements mapping to each cell in the output table, use aggfunc=['count', 'sum'].
As you could probably guess, Pandas supports this kind of pivot operation using multiple columns for the row index, column index, and values. So if you have input like this
The most important parameter here is value_vars - a list of the columns you want to stack In fact, that's literally all you need to provide for this function to work (although the output isn't very friendly without other information).
var_name and value_name tell pandas how to name the new variable and value columns, and id_vars tells Pandas what other columns from the original DataFrame you want to keep in the result.
If you have a DataFrame of daily stock prices like this
You can use df.stack() to reshape the DataFrame into a Series, literally stacking the values of ABC, DEF, and GHI within each original row index, resulting in a 2-level row MultiIndex.
Things get a little trickier when df has a column MultiIndex. Consider this DataFrame.
stack()'s most important parameter is level. It tells Pandas which level(s) from the column index you want to stack into the row index.
Let's see some examples, keeping in mind that our pets DataFrame has two levels in its column index:
type with values dog and cat
subtype with values persian, calico, corgie, and lab.
If we call pets.stack(level='type'), you can see how the first level, type, gets extracted from the column index and inserted into the row index, and the data re-aligns itself accordingly
Pretty much the same thing happens when you call stack(level='subtype').
???+ note "level=-1"
levels can also be a list like ['type', 'subtype'] in which case both column levels are stacked into the row index and the result is a Series, not a DataFrame.