独家 | 手把手教你怎样用Python生成漂亮且精辟的图像（附教程代码）-白红宇

独家 | 手把手教你怎样用Python生成漂亮且精辟的图像（附教程代码）

阅读量：4225 次

发布时间：2019-05-26

本文共 11077 字，大约阅读时间需要 36 分钟。

640?wx_fmt=png

作者：Fabian Bosler

翻译：车前子

校对：吴振东

本文约

4800字 ，建议阅读

15分钟 。

本文将介绍如何利用Python生成图像并将结果做出可视化分析。

640?wx_fmt=png

在上周的文章《用python从不同的表单中提取数据》中，学习了如何从不同的源（Google Sheets、CSV和Excel）检索和统一数据。本教程与上一篇文章是相互独立的，所以你不必担心错过了上周的文章。

640?wx_fmt=png

在今天的教程中，你将会学到：

如何预处理和合并数据，

如何探索并分析数据，

如何做出漂亮的图表对结果进行可视化。

这篇教程面向：

经常从事数据相关工作，

对Python和Pandas有初步理解的人。

情景概述：

你的任务是找出提高你的销售团队业绩的方法。在我们所假设的情况下，潜在客户有相当自发的需求。当客户提出需求时，你的销售团队会在系统中设置一个订单商机。然后，你的销售代表安排一次会议，会议将在发现订单商机前后举行。你的销售代表有一个开支预算，预算中包括会议费用和餐费。销售代表支付这些花销并将发票交给会计团队处理。在潜在客户决定是否愿意接受你的报价后，勤劳的销售代表会跟踪订单商机是否转化为销售。

你可以使用以下三个数据集进行分析：

order_leads（包含所有订单线索和转化信息）

sales_team（包括公司和负责的销售代表）

invoices（提供发票和参与者的信息）

载入程序包和属性设置：

1. import json  	2. import pandas as pd  	3. import numpy as np  	4. %matplotlib inline  	5. import matplotlib.pyplot as plt  	6. import seaborn as sns  	7. sns.set(  	8.     font_scale=1.5,  	9.     ,  	10.     rc={'figure.figsize':(20,7)}  	11. )

这里用到都是相当标准的库。你有可能会需要运行下面的命令来在你的Notebook里安装seaborn。

1. !pip install seaborn

载入数据：

你可以下载并合并上周文章中的实例数据，或者点击这里下载文件并将其加载到Notebook中。

https://github.com/FBosler/Medium-Data-Exploration

1. sales_team = pd.read_csv('sales_team.csv') 	2. order_leads = pd.read_csv('order_leads.csv') 	3. invoices = pd.read_csv('invoices.csv')

640?wx_fmt=png

sales_team数据集的前两行

640?wx_fmt=png

order_leads数据集的前两行

640?wx_fmt=png

invoice数据集的前两行

…

开始探索数据：

总转化率的发展趋势：

640?wx_fmt=png

转化率随时间的变化

显然，从2017年初开始转化率似乎有所下降。与首席销售官核实后发现，当时有一个竞争对手进入市场。很高兴知道这点，但我们现在对此无能为力。

1. _ = order_leads.set_index(pd.DatetimeIndex(order_leads.Date)).groupby( 	2.     pd.Grouper(freq='D')  	3. )['Converted'].mean()  	4.   	5. ax = _.rolling(60).mean().plot(figsize=(20,7),title='Conversion Rate Over Time')  	6.   	7. vals = ax.get_yticks() 	8. ax.set_yticklabels(['{:,.0f}%'.format(x*100) for x in vals])  	9. sns.despine()

1.我们使用下划线“_”作为临时变量。我通常会这样生成以后不会再使用的一次性变量。

2.我们对order_leads.Date使用pd.DateTimeIndex，将其设置为序号。

3.使用pd.grouped（freq='D'）按天对数据进行分组。或者，你可以将频率更改为W、M、Q或Y（周、月、季或年）。

4.我们计算每天“转化”的平均值，即当天订单的转化率。

5.我们使用.rolling（60）和.mean（）得到60天的平均值。

6.然后，我们设置yticklables的格式，使其显示百分比符号。

不同销售代表的转化率：

640?wx_fmt=png

销售代表之间的转化率似乎有很大的差异，我们再调查一下。

1. orders_with_sales_team = pd.merge(order_leads,sales_team,on=['Company Id','Company Name'])  	2. ax = sns.distplot(orders_with_sales_team.groupby('Sales Rep Id')['Converted'].mean(),kde=False)  	3. vals = ax.get_xticks() 	4. ax.set_xticklabels(['{:,.0f}%'.format(x*100) for x in vals])  	5. ax.set_title('Number of sales reps by conversion rate')  	6. sns.despine()

就使用的函数而言，这里没有太多的新函数。但请注意我们如何使用sns.distplot将数据绘制到轴上。

如果我们回忆销售团队的数据，我们记得并非所有的销售代表都有相同数量的客户，这肯定会对结果有影响！让我们检查一下。

640?wx_fmt=png

不同分配账户数量的转化率

我们可以看到，转化率的数量似乎与分配给销售代表的帐户数量成反比，那些降低的转换率是有意义的。毕竟，销售代表的账户越多，他在每个人身上花费的时间就越少。

1. def vertical_mean_line(x, **kwargs):  	2.     ls = {"0":"-","1":"--"}  	3.     plt.axvline(x.mean(), linestyle =ls[kwargs.get("label","0")],   	4.                 color = kwargs.get("color", "r"))  	5.     txkw = dict(size=15, color = kwargs.get("color", "r"))  	6.     tx = "mean: {:.1f}%\n(std: {:.1f}%)".format(x.mean()*100,x.std()*100)  	7.     label_x_pos_adjustment = 0.015   	8.     label_y_pos_adjustment = 20  	9.     plt.text(x.mean() + label_x_pos_adjustment, label_y_pos_adjustment, tx, **txkw)  	10.   	11. sns.set(  	12.     font_scale=1.5,  	13.       	14. )  	15.   	16. _ = orders_with_sales_team.groupby('Sales Rep Id').agg({  	17.     'Converted': np.mean,  	18.     'Company Id': pd.Series.nunique  	19. })  	20. _.columns = ['conversion rate','number of accounts']  	21.   	22. g = sns.FacetGrid(_, col="number of accounts", height=4, aspect=0.9, col_wrap=5)  	23. g.map(sns.kdeplot, "conversion rate", shade=True)  	24. g.set(xlim=(0, 0.35))  	25. g.map(vertical_mean_line, "conversion rate")

在这里，我们先创建一个函数，它将把垂直线映射到每个子块中，并用数据的平均值和标准偏差来注释这条线。然后，我们设置一些seaborn绘图默认值，如较大的字体font_scale和白色网格作为样式 style。

用餐的影响：

640?wx_fmt=png

用餐数据示例

看起来我们有用餐日期和时间的数据，来快速看一下时间的分布：

1. invoices['Date of Meal'] = pd.to_datetime(invoices['Date of Meal']) 	2. invoices['Date of Meal'].dt.time.value_counts().sort_index()

out:

07:00:00 5536

08:00:00 5613

09:00:00 5473

12:00:00 5614

13:00:00 5412

14:00:00 5633

20:00:00 5528

21:00:00 5534

22:00:00 5647

看起来我们可以总结一下：

1. invoices['Type of Meal'] = pd.cut(  	2.     invoices['Date of Meal'].dt.hour,  	3.     bins=[0,10,15,24],  	4.     labels=['breakfast','lunch','dinner']  	5. )

请注意如何使用pd.cut将连续变量分组，这样做的意义是早餐是8点还是9点开始可能并不重要。

另外，请注意如何使用.dt.hour，我们只能这样做，因为我们将invoices[‘Date of Meal’]转换为日期时间。.dt是一个“访问器”，一共有三类cat，str，dt。如果你的数据是正确的类型，则可以使用这些访问器及其方法进行直接操作（计算效率高且简洁）。

不凑巧的是，我们必须把第一个字符串invoices['Participants']转换成合法的JSON，这样可以提取参与者的数量。

1. def replace(x):  	2.     return x.replace("\n ",",").replace("' '","','").replace("'",'"')  	3.   	4. invoices['Participants'] = invoices['Participants'].apply(lambda x: replace(x))  	5. invoices['Number Participants'] = invoices['Participants'].apply(lambda x:  len(json.loads(x)))

现在来合并数据。为此，我们先将所有invoice数据与order_leads数据通过 company Id左连接。然而，合并数据会导致所有的用餐数据都匹配到订单上，也有些很久以前的用餐匹配到新的订单数据上。为了减少这种情况，我们计算了用餐和订单之间的时间差，并且只考虑在订单前后5天的用餐。

仍有一些订单匹配了多个用餐信息。这可能发生在同时有两个订单也有两次用餐的情况。两个订单线索都会匹配两次用餐。为了去掉那些重复数据，我们只保留与订单时间最接近的那个订单。

1. # combine order_leads with invoice data  	2. orders_with_invoices = pd.merge(order_leads,invoices,how='left',on='Company Id')  	3.   	4. # calculate days between order leads and invocies  	5. orders_with_invoices['Days of meal before order'] = (  	6.     pd.to_datetime(orders_with_invoices['Date']) - orders_with_invoices['Date of Meal']  	7. ).dt.days  	8.   	9. # limit to only meals that are within 5 days of the order  	10. orders_with_invoices = orders_with_invoices[abs(orders_with_invoices['Days of meal before order']) < 5]  	11.   	12. # To mnake sure that we don't cross assign meals to multiple orders and therefore create duplicates  	13. # we first sort our data by absolute distance to the orders  	14. orders_with_invoices = orders_with_invoices.loc[  	15.     abs(orders_with_invoices['Days of meal before order']).sort_values().index  	16. ]  	17.   	18. # keep the first (i.e. closest to sales event) sales order  	19. orders_with_invoices = orders_with_invoices.drop_duplicates(subset=['Order Id'])  	20.   	21. orders_without_invoices = order_leads[~order_leads['Order Id'].isin(orders_with_invoices['Order Id'].unique())]  	22.   	23. orders_with_meals = pd.concat([orders_with_invoices,orders_without_invoices],sort=True)

640?wx_fmt=png

部分合并后数据集

我创建了一个柱状图函数，其中已经包含一些样式。通过该函数进行绘图，可以使可视化更快捷。我们现在就来使用这个函数。

1. def plot_bars(data,x_col,y_col):  	2.     data = data.reset_index()  	3.     sns.set(  	4.         font_scale=1.5,  	5.         ,  	6.         rc={'figure.figsize':(20,7)}  	7.     )  	8.     g = sns.barplot(x=x_col, y=y_col, data=data, color='royalblue')  	9.   	10.     for p in g.patches:  	11.         g.annotate(  	12.             format(p.get_height(), '.2%'),  	13.             (p.get_x() + p.get_width() / 2., p.get_height()),   	14.             ha = 'center',   	15.             va = 'center',   	16.             xytext = (0, 10),   	17.             textcoords = 'offset points'  	18.         )  	19.           	20.     vals = g.get_yticks()  	21.     g.set_yticklabels(['{:,.0f}%'.format(x*100) for x in vals])  	22.   	23.     sns.despine()

用餐种类的影响：

640?wx_fmt=png

1. orders_with_meals['Type of Meal'].fillna('no meal',inplace=True)  	2. _ = orders_with_meals.groupby('Type of Meal').agg({'Converted': np.mean})  	3. plot_bars(_,x_col='Type of Meal',y_col='Converted')

真的！有没有用餐信息的订单转化率有明显的差别。不过，午餐的转化率似乎略低于晚餐或早餐。

时机的影响（如用餐发生在订单前或后）：

640?wx_fmt=png

1. _ = orders_with_meals.groupby(['Days of meal before order']).agg(  	2.     {'Converted': np.mean}  	3. )  	4. plot_bars(data=_,x_col='Days of meal before order',y_col='Converted'))

“用餐在订单前几天发生”为负数意味着用餐是在订单线索出现之后。我们可以看到，如果用餐在在订单线索出现前发生似乎对转化率有一个积极的影响，看来对订单的事先了解使我们的销售代表更有优势。

合并所有结果：

现在我们将使用热图同时显示数据的多个维度。为此我们先创建一个函数。

1. def draw_heatmap(data,inner_row, inner_col, outer_row, outer_col, values):  	2.     sns.set(font_scale=1)  	3.     fg = sns.FacetGrid(  	4.         data,   	5.         row=outer_row,  	6.         col=outer_col,   	7.         margin_titles=True  	8.     )  	9.   	10.     position = left, bottom, width, height = 1.4, .2, .1, .6  	11.     cbar_ax = fg.fig.add_axes(position)   	12.   	13.     fg.map_dataframe(  	14.         draw_heatmap_facet,   	15.         x_col=inner_col,  	16.         y_col=inner_row,   	17.         values=values,   	18.         cbar_ax=cbar_ax,  	19.         vmin=0,   	20.         vmax=.4  	21.     )  	22.   	23.     fg.fig.subplots_adjust(right=1.3)    	24.     plt.show()  	25.   	26. def draw_heatmap_facet(*args, **kwargs):  	27.     data = kwargs.pop('data')  	28.     x_col = kwargs.pop('x_col')  	29.     y_col = kwargs.pop('y_col')  	30.     values = kwargs.pop('values')  	31.     d = data.pivot(index=y_col, columns=x_col, values=values)  	32.     annot = round(d,4).values  	33.     cmap = sns.color_palette("RdYlGn",30)  	34.     # cmap = sns.color_palette("PuBu",30) alternative color coding  	35.     sns.heatmap(d, **kwargs, annot=annot, center=0, fmt=".1%", cmap=cmap, linewidth=.5)

然后，我们应用一些数据处理来探索用餐花销与订单价值的关系，并将我们的用餐时间划分为订单前（Before Order）、订单前后（Around Order）、订单后（After Order），而不是从负4到正4的天数，因为这解读起来会比较麻烦。

1. # Aggregate the data a bit  	2. orders_with_meals['Meal Price / Order Value'] = orders_with_meals['Meal Price']/orders_with_meals['Order Value']  	3. orders_with_meals['Meal Price / Order Value'] = pd.qcut(  	4.     orders_with_meals['Meal Price / Order Value']*-1,  	5.     5,  	6.     labels = ['Least Expensive','Less Expensive','Proportional','More Expensive','Most Expensive'][::-1]  	7. )  	8.   	9. orders_with_meals['Timing of Meal'] = pd.qcut(  	10.     orders_with_meals['Days of meal before order'],  	11.     3,  	12.     labels = ['After Order','Around Order','Before Order']  	13. )  	14.   	15.   	16. data = orders_with_meals[orders_with_meals['Type of Meal'] != 'no meal'].groupby(  	17.     ['Timing of Meal','Number Participants','Type of Meal','Meal Price / Order Value']  	18. ).agg({'Converted': np.mean}).unstack().fillna(0).stack().reset_index()

运行下面的代码片段将生成多维热图。

1. draw_heatmap(  	2.     data=data,   	3.     outer_row='Timing of Meal',  	4.     outer_col='Type of Meal',  	5.     inner_row='Meal Price / Order Value',  	6.     inner_col='Number Participants',  	7.     values='Converted'  	8. )

640?wx_fmt=png