Extension Data Types in Pandas

Extension Data Types in Pandas

Pandas was built on top of the capabilities of NumPy. As a result, some of its features are just NumPy functions working under the hood: a Pandas Series can be described as a wrapper for a NumPy array, and a DataFrame is basically a collection of Series sharing the same index.

Building on top of NumPy like that had some side effects. For example, NumPy doesn't know how to handle missing data for integer, bool, and string data types.

So whenever there was missing data for such data types, Pandas converted the type to float64 and used np.nan to represent the missing values. As you would expect, this shortcoming often caused some subtle bugs in pandas’ algorithms.

To address these shortcomings, the Pandas community developed an extension type system, which provides an interface for handling data types that are not supported natively in NumPy, thus extending NumPy’s type system. For example, let’s consider a Series of integer values with missing data:

s1 = pd.Series([1, 2, 3, 4, None])
>>> s1
0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
dtype: float64

Notice how, even though we created a series of integer values, Series automatically converts them to float and uses np.nan (np is short for NumPy) to represent a missing value.

Using one of Pandas's extension type methods, let's recreate this Series, specifying that the array be integer values, instead of floats:

>>> s2 = pd.Series([1, 2, 3, 4, None], dtype=pd.Int64Dtype())
>>> s2
0       1
1       2
2       3
3       4
4    <NA>
dtype: Int64
>>> s2.dtype
Int64Dtype()

The dtype is now an array of integer values, just as we specified. And it uses Pandas's special sentinel value, NA, instead of np.nan, to show that a value is missing for an extension type.

One thing about Pandas's extension type system is its flexibility. Instead of pd.Int64, we could just use the shorter int64 to specify data type:

 s2 = pd.Series([1, 2, 3, 4, None], dtype="Int64")
# Preserve the capitalization and quotation marks.
>>> s2.dtype
Int64Dtype()

You can also pass extension types to the Series astype method. This will allow you to specify a different extension type for each series in a DataFrame.

For example, consider this DataFrame array containing a mix of integer, string, boolean, and None values:

>> df = pd.DataFrame({"A": [1, 2, 3, None], "B": ['one', 'two', 'three', None], "C": [False, True, None, False]})
>>> df
     A      B      C
0  1.0    one  False
1  2.0    two   True
2  3.0  three   None
3  NaN   None  False

First, let's check how Pandas defines each dtype in this DataFrame:

 >>> df["A"].dtype
dtype('float64') # cloumn A is
>>> df["B"].dtype
dtype('O') # The 'O' means 'object.'
>>> df["C"].dtype
dtype('O')

Pandas extension type lets you specify a dtype for each column in the DataFrame, allowing you to convert easily from one dtype to another:

>>> df["A"] = df["A"].astype("Int64")
>>> df["B"] = df["B"].astype("string")
>>> df["C"] = df["C"].astype("boolean")
>>> df
      A      B      C
0     1    one  False
1     2    two   True
2     3  three   <NA>
3  <NA>   <NA>  False

Here's a list of the commonly used extension types in Pandas:

Extension TypeDescription
BooleanDtypeDefines Boolean data. Use "boolean" as the argument.
Int64DtypeDefines 64-bit integer values. Use "Int64" when passing as an argument.
Float64DtypeSpecifies float point data. Use "float6 as the argument.
DatetimeTZDtypeFor datetime with time zones included

The extension type system in Pandas is relatively new and a little advanced for beginners. This article gives only a light introduction to this new feature. To read more advanced techniques, like how to customize your own Pandas Extension Type, browse through the documentation or read this Medium article.