Pandas was built on top of the capabilities of NumPy. As a result, some of its features are just NumPy functions working under the hood: a Pandas Series
can be described as a wrapper for a NumPy array
, and a DataFrame
is basically a collection of Series
sharing the same index.
Building on top of NumPy like that had some side effects. For example, NumPy doesn't know how to handle missing data for integer
, bool
, and string
data types.
So whenever there was missing data for such data types, Pandas converted the type to float64
and used np.nan
to represent the missing values. As you would expect, this shortcoming often caused some subtle bugs in pandas’ algorithms.
To address these shortcomings, the Pandas community developed an extension type
system, which provides an interface for handling data types that are not supported natively in NumPy, thus extending NumPy’s type system. For example, let’s consider a Series
of integer values with missing data:
s1 = pd.Series([1, 2, 3, 4, None])
>>> s1
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
dtype: float64
Notice how, even though we created a series of integer values, Series
automatically converts them to float and uses np.nan
(np
is short for NumPy) to represent a missing value.
Using one of Pandas's extension type
methods, let's recreate this Series, specifying that the array be integer values, instead of floats:
>>> s2 = pd.Series([1, 2, 3, 4, None], dtype=pd.Int64Dtype())
>>> s2
0 1
1 2
2 3
3 4
4 <NA>
dtype: Int64
>>> s2.dtype
Int64Dtype()
The dtype
is now an array of integer values, just as we specified. And it uses Pandas's special sentinel value, NA
, instead of np.nan
, to show that a value is missing for an extension type.
One thing about Pandas's extension type system is its flexibility. Instead of pd.Int64
, we could just use the shorter int64
to specify data type:
s2 = pd.Series([1, 2, 3, 4, None], dtype="Int64")
# Preserve the capitalization and quotation marks.
>>> s2.dtype
Int64Dtype()
You can also pass extension types to the Series astype
method. This will allow you to specify a different extension type for each series in a DataFrame
.
For example, consider this DataFrame array containing a mix of integer, string, boolean, and None values:
>> df = pd.DataFrame({"A": [1, 2, 3, None], "B": ['one', 'two', 'three', None], "C": [False, True, None, False]})
>>> df
A B C
0 1.0 one False
1 2.0 two True
2 3.0 three None
3 NaN None False
First, let's check how Pandas defines each dtype in this DataFrame:
>>> df["A"].dtype
dtype('float64') # cloumn A is
>>> df["B"].dtype
dtype('O') # The 'O' means 'object.'
>>> df["C"].dtype
dtype('O')
Pandas extension type lets you specify a dtype
for each column in the DataFrame, allowing you to convert easily from one dtype
to another:
>>> df["A"] = df["A"].astype("Int64")
>>> df["B"] = df["B"].astype("string")
>>> df["C"] = df["C"].astype("boolean")
>>> df
A B C
0 1 one False
1 2 two True
2 3 three <NA>
3 <NA> <NA> False
Here's a list of the commonly used extension types in Pandas:
Extension Type | Description |
BooleanDtype | Defines Boolean data. Use "boolean" as the argument. |
Int64Dtype | Defines 64-bit integer values. Use "Int64" when passing as an argument. |
Float64Dtype | Specifies float point data. Use "float6 as the argument. |
DatetimeTZDtype | For datetime with time zones included |
The extension type system in Pandas is relatively new and a little advanced for beginners. This article gives only a light introduction to this new feature. To read more advanced techniques, like how to customize your own Pandas Extension Type, browse through the documentation or read this Medium article.