如何在Pandas中以SQL查询样式选择数据子集？

介绍

在本文中，我将向您展示如何使用Pandas的SQL样式过滤来执行数据分析。公司的大多数数据都存储在需要SQL来检索和操作它的数据库中。例如，有像Oracle，IBM，Microsoft这样的公司都有自己的数据库和自己的SQL实现。

数据科学家必须在其职业生涯的某些阶段处理SQL，因为数据并不总是存储在CSV文件中。我个人更喜欢使用Oracle，因为我公司的大部分数据都存储在Oracle中。

场景– 1假设我们有一项任务，要在以下条件下从电影数据集中查找所有电影。

电影的语言应为英语或西班牙语。
电影的受欢迎程度必须在500到1000之间。
电影的状态必须被释放。
投票数必须大于5000。对于上述情况，SQL语句如下所示。

SELECT
   title AS movie_title
   ,original_language AS movie_language
   ,popularityAS movie_popularity
   ,statusAS movie_status
   ,vote_count AS movie_vote_count movies_data
FROM
   movies_data
WHERE
   original_languageIN ('en', 'es')
AND status=('Released')
AND popularitybetween 500 AND 1000
AND vote_count > 5000;

现在您已经了解了需求的SQL，让我们逐步使用pandas进行此操作。我将向您展示两种方法。

方法1：布尔索引

将movies_data数据集加载到DataFrame。

import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )

为每个条件分配一个变量。

languages = ["en","es"]
condition_on_languages = movies.original_language.isin(languages)
condition_on_status = movies.status == "Released"
condition_on_popularity = movies.popularity.between(500, 1000) c
ondition_on_votecount = movies.vote_count > 5000

将所有条件（布尔数组）组合在一起。

final_conditions = (condition_on_languages & condition_on_status & condition_on_popularity & conditi on_on_votecount )
columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ]
# clubbing all together movies.loc[final_conditions, columns]

标题	original_language	状态	人气	投票数
95星际	恩	已发行	724.247784	10867
788死侍	恩	已发行	514.569956	10995

方法2：-.query()方法。

.query()方法是一种SQL where子句样式的数据过滤方法。可以将条件作为字符串传递给此方法，但是，列名不能包含任何空格。

如果列名中有空格，请使用python replace函数将其替换为下划线。

根据我的经验，我发现将query()方法应用于较大的DataFrame时比以前的方法要快。

import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )

构建查询字符串并执行该方法。

请注意，.query方法不适用于跨越多行的三引号引起来的字符串。

final_conditions = (
   "original_language in ['en','es']"
   "and status == 'Released' "
   "and popularity > 500 "
   "and popularity < 1000" "and vote_count > 5000"
)
final_result = movies.query(final_conditions)
final_result

	预算	ID	original_language	original_title	人气	发布日期	收入	运行	ST
95	165000000	157336	恩	星际	724.247784	5/11/2014	675120017	169.0	瑞尔
788	58000000	293660	恩	死侍	514.569956	9/02/2016	783112979	108.0	瑞尔

在我的编码中，通常还有更多值要检查“ in”子句中的值。因此，上面的语法不太适合使用。可以使用at符号（@）引用Python变量。

您还可以以编程方式将值创建为python列表，并将其与（@）一起使用。

movie_languages = ['en','es']
final_conditions = (
   "original_language in @movie_languages "
   "and status == 'Released' " "and popularity > 500 "
   "and popularity < 1000" "and vote_count > 5000" 
) 
final_result = movies.query(final_conditions) 
final_result

	预算	ID	original_language	original_title	人气	发布日期	收入	运行	ST
95	165000000	157336	恩	星际	724.247784	5/11/2014	675120017	169.0	瑞尔
788	58000000	293660	恩	死侍	514.569956	9/02/2016	783112979	108.0	瑞尔

基础教程