Pandas: как проверить dtype для всех столбцов в DataFrame
Вы можете использовать следующие методы для проверки типа данных ( dtype ) для столбцов в кадре данных pandas:
Способ 1: проверить dtype одного столбца
Способ 2: проверить dtype всех столбцов
Способ 3: проверьте, какие столбцы имеют определенный тип dtype
В следующих примерах показано, как использовать каждый метод со следующими пандами DataFrame:
Пример 1: проверка dtype одного столбца
Мы можем использовать следующий синтаксис, чтобы проверить тип данных только столбца точек в DataFrame:
Из вывода мы видим, что столбец точек имеет целочисленный тип данных.
Пример 2: Проверка dtype всех столбцов
Мы можем использовать следующий синтаксис для проверки типа данных всех столбцов в DataFrame:
Из вывода мы видим:
- Столбец команды : объект (это то же самое, что и строка)
- столбец очков : целое число
- столбец помогает : целое число
- столбец all_star : логическое значение
Используя эту одну строку кода, мы можем увидеть тип данных каждого столбца в DataFrame.
Пример 3: проверьте, какие столбцы имеют определенный тип dtype
Мы можем использовать следующий синтаксис, чтобы проверить, какие столбцы в DataFrame имеют тип данных int64:
Из вывода мы видим, что столбцы очков и помощи имеют тип данных int64.
Мы можем использовать аналогичный синтаксис, чтобы проверить, какие столбцы имеют другие типы данных.
Например, мы можем использовать следующий синтаксис, чтобы проверить, какие столбцы в DataFrame имеют тип данных объекта:
Мы видим, что только столбец team имеет тип данных «O», что означает объект.
Дополнительные ресурсы
В следующих руководствах объясняется, как выполнять другие распространенные операции с пандами DataFrames:
How to Check the Data Type in Pandas DataFrame?
Pandas DataFrame is a Two-dimensional data structure of mutable size and heterogeneous tabular data. There are different Built-in data types available in Python. Two methods used to check the datatypes are pandas.DataFrame.dtypes and pandas.DataFrame.select_dtypes.
Creating a Dataframe to Check DataType in Pandas DataFrame
Consider a dataset of a shopping store having data about Customer Serial Number, Customer Name, Product ID of the purchased item, Product Cost, and Date of Purchase.
How to Check the Data Type in Pandas DataFrame
You may use the following syntax to check the data type of all columns in Pandas DataFrame:
Alternatively, you may use the syntax below to check the data type of a particular column in Pandas DataFrame:
Steps to Check the Data Type in Pandas DataFrame
Step 1: Gather the Data for the DataFrame
To start, gather the data for your DataFrame.
For illustration purposes, let’s use the following data about products and prices:
Products | Prices |
AAA | 200 |
BBB | 700 |
CCC | 400 |
DDD | 1200 |
EEE | 900 |
The goal is to check the data type of the above columns across multiple scenarios.
Step 2: Create the DataFrame
Next, create the actual DataFrame based on the following syntax:
Once you run the code in Python, you’ll get this DataFrame:
Note that initially the values under the ‘Prices’ column were stored as strings by placing quotes around those values.
Step 3: Check the Data Type
You can now check the data type of all columns in the DataFrame by adding df.dtypes to the code:
Here is the complete Python code for our example:
You’ll notice that the data type for both columns is ‘Object‘ which represents strings:
Let’s now remove the quotes for all the values under the ‘Prices’ column:
After the removal of the quotes, the data type for the ‘Prices’ column would become integer:
Checking the Data Type of a Particular Column in Pandas DataFrame
Let’s now check the data type of a particular column (e.g., the ‘Prices’ column) in our DataFrame:
Here is the full syntax for our example:
The data type for the ‘Prices’ column would be integer:
But what if you want to convert the data type from integer to float?
You may then apply this template to perform the conversion:
For instance, let’s convert the ‘Prices’ column from integer to float:
Once you run the code, you’ll notice that the data type for the ‘Prices’ column is now float:
how to check the dtype of a column in python pandas
I need to use different functions to treat numeric columns and string columns. What I am doing now is really dumb:
Is there a more elegant way to do this? E.g.
6 Answers 6
You can access the data-type of a column with dtype :
In pandas 0.20.2 you can do:
So your code becomes:
I know this is a bit of an old thread but with pandas 19.02, you can do:
Asked question title is general, but authors use case stated in the body of the question is specific. So any other answers may be used.
But in order to fully answer the title question it should be clarified that it seems like all of the approaches may fail in some cases and require some rework. I reviewed all of them (and some additional) in decreasing of reliability order (in my opinion):
1. Comparing types directly via == (accepted answer).
Despite the fact that this is accepted answer and has most upvotes count, I think this method should not be used at all. Because in fact this approach is discouraged in python as mentioned several times here.
But if one still want to use it — should be aware of some pandas-specific dtypes like pd.CategoricalDType , pd.PeriodDtype , or pd.IntervalDtype . Here one have to use extra type( ) in order to recognize dtype correctly:
Another caveat here is that type should be pointed out precisely:
2. isinstance() approach.
This method has not been mentioned in answers so far.
So if direct comparing of types is not a good idea — lets try built-in python function for this purpose, namely — isinstance() .
It fails just in the beginning, because assumes that we have some objects, but pd.Series or pd.DataFrame may be used as just empty containers with predefined dtype but no objects in it:
But if one somehow overcome this issue, and wants to access each object, for example, in the first row and checks its dtype like something like that:
It will be misleading in the case of mixed type of data in single column:
And last but not least — this method cannot directly recognize Category dtype. As stated in docs:
Returning a single item from categorical data will also return the value, not a categorical of length “1”.
So this method is also almost inapplicable.
3. df.dtype.kind approach.
This method yet may work with empty pd.Series or pd.DataFrames but has another problems.
First — it is unable to differ some dtypes:
Second, what is actually still unclear for me, it even returns on some dtypes None.
4. df.select_dtypes approach.
This is almost what we want. This method designed inside pandas so it handles most corner cases mentioned earlier — empty DataFrames, differs numpy or pandas-specific dtypes well. It works well with single dtype like .select_dtypes(‘bool’) . It may be used even for selecting groups of columns based on dtype:
Like so, as stated in the docs:
On may think that here we see first unexpected (at used to be for me: question) results — TimeDelta is included into output DataFrame . But as answered in contrary it should be so, but one have to be aware of it. Note that bool dtype is skipped, that may be also undesired for someone, but it’s due to bool and number are in different «subtrees» of numpy dtypes. In case with bool, we may use test.select_dtypes([‘bool’]) here.
Next restriction of this method is that for current version of pandas (0.24.2), this code: test.select_dtypes(‘period’) will raise NotImplementedError .
And another thing is that it’s unable to differ strings from other objects:
But this is, first — already mentioned in the docs. And second — is not the problem of this method, rather the way strings are stored in DataFrame . But anyway this case have to have some post processing.
5. df.api.types.is_XXX_dtype approach.
This one is intended to be most robust and native way to achieve dtype recognition (path of the module where functions resides says by itself) as i suppose. And it works almost perfectly, but still have at least one caveat and still have to somehow distinguish string columns.
Besides, this may be subjective, but this approach also has more ‘human-understandable’ number dtypes group processing comparing with .select_dtypes(‘number’) :
No timedelta and bool is included. Perfect.
My pipeline exploits exactly this functionality at this moment of time, plus a bit of post hand processing.