本文共 11077 字,大约阅读时间需要 36 分钟。
作者:Fabian Bosler
翻译:车前子
校对:吴振东
本文将介绍如何利用Python生成图像并将结果做出可视化分析。
如何预处理和合并数据,
如何探索并分析数据,
如何做出漂亮的图表对结果进行可视化。
经常从事数据相关工作,
对Python和Pandas有初步理解的人。
order_leads(包含所有订单线索和转化信息)
sales_team(包括公司和负责的销售代表)
invoices(提供发票和参与者的信息)
1. import json 2. import pandas as pd 3. import numpy as np 4. %matplotlib inline 5. import matplotlib.pyplot as plt 6. import seaborn as sns 7. sns.set( 8. font_scale=1.5, 9. , 10. rc={'figure.figsize':(20,7)} 11. )
1. !pip install seaborn
https://github.com/FBosler/Medium-Data-Exploration
1. sales_team = pd.read_csv('sales_team.csv') 2. order_leads = pd.read_csv('order_leads.csv') 3. invoices = pd.read_csv('invoices.csv')
order_leads数据集的前两行
1. _ = order_leads.set_index(pd.DatetimeIndex(order_leads.Date)).groupby( 2. pd.Grouper(freq='D') 3. )['Converted'].mean() 4. 5. ax = _.rolling(60).mean().plot(figsize=(20,7),title='Conversion Rate Over Time') 6. 7. vals = ax.get_yticks() 8. ax.set_yticklabels(['{:,.0f}%'.format(x*100) for x in vals]) 9. sns.despine()
1.我们使用下划线“_”作为临时变量。我通常会这样生成以后不会再使用的一次性变量。
2.我们对order_leads.Date使用pd.DateTimeIndex,将其设置为序号。
3.使用pd.grouped(freq='D')按天对数据进行分组。或者,你可以将频率更改为W、M、Q或Y(周、月、季或年)。
4.我们计算每天“转化”的平均值,即当天订单的转化率。
5.我们使用.rolling(60)和.mean()得到60天的平均值。
6.然后,我们设置yticklables的格式,使其显示百分比符号。
1. orders_with_sales_team = pd.merge(order_leads,sales_team,on=['Company Id','Company Name']) 2. ax = sns.distplot(orders_with_sales_team.groupby('Sales Rep Id')['Converted'].mean(),kde=False) 3. vals = ax.get_xticks() 4. ax.set_xticklabels(['{:,.0f}%'.format(x*100) for x in vals]) 5. ax.set_title('Number of sales reps by conversion rate') 6. sns.despine()
1. def vertical_mean_line(x, **kwargs): 2. ls = {"0":"-","1":"--"} 3. plt.axvline(x.mean(), linestyle =ls[kwargs.get("label","0")], 4. color = kwargs.get("color", "r")) 5. txkw = dict(size=15, color = kwargs.get("color", "r")) 6. tx = "mean: {:.1f}%\n(std: {:.1f}%)".format(x.mean()*100,x.std()*100) 7. label_x_pos_adjustment = 0.015 8. label_y_pos_adjustment = 20 9. plt.text(x.mean() + label_x_pos_adjustment, label_y_pos_adjustment, tx, **txkw) 10. 11. sns.set( 12. font_scale=1.5, 13. 14. ) 15. 16. _ = orders_with_sales_team.groupby('Sales Rep Id').agg({ 17. 'Converted': np.mean, 18. 'Company Id': pd.Series.nunique 19. }) 20. _.columns = ['conversion rate','number of accounts'] 21. 22. g = sns.FacetGrid(_, col="number of accounts", height=4, aspect=0.9, col_wrap=5) 23. g.map(sns.kdeplot, "conversion rate", shade=True) 24. g.set(xlim=(0, 0.35)) 25. g.map(vertical_mean_line, "conversion rate")
1. invoices['Date of Meal'] = pd.to_datetime(invoices['Date of Meal']) 2. invoices['Date of Meal'].dt.time.value_counts().sort_index()
1. invoices['Type of Meal'] = pd.cut( 2. invoices['Date of Meal'].dt.hour, 3. bins=[0,10,15,24], 4. labels=['breakfast','lunch','dinner'] 5. )
1. def replace(x): 2. return x.replace("\n ",",").replace("' '","','").replace("'",'"') 3. 4. invoices['Participants'] = invoices['Participants'].apply(lambda x: replace(x)) 5. invoices['Number Participants'] = invoices['Participants'].apply(lambda x: len(json.loads(x)))
1. # combine order_leads with invoice data 2. orders_with_invoices = pd.merge(order_leads,invoices,how='left',on='Company Id') 3. 4. # calculate days between order leads and invocies 5. orders_with_invoices['Days of meal before order'] = ( 6. pd.to_datetime(orders_with_invoices['Date']) - orders_with_invoices['Date of Meal'] 7. ).dt.days 8. 9. # limit to only meals that are within 5 days of the order 10. orders_with_invoices = orders_with_invoices[abs(orders_with_invoices['Days of meal before order']) < 5] 11. 12. # To mnake sure that we don't cross assign meals to multiple orders and therefore create duplicates 13. # we first sort our data by absolute distance to the orders 14. orders_with_invoices = orders_with_invoices.loc[ 15. abs(orders_with_invoices['Days of meal before order']).sort_values().index 16. ] 17. 18. # keep the first (i.e. closest to sales event) sales order 19. orders_with_invoices = orders_with_invoices.drop_duplicates(subset=['Order Id']) 20. 21. orders_without_invoices = order_leads[~order_leads['Order Id'].isin(orders_with_invoices['Order Id'].unique())] 22. 23. orders_with_meals = pd.concat([orders_with_invoices,orders_without_invoices],sort=True)
1. def plot_bars(data,x_col,y_col): 2. data = data.reset_index() 3. sns.set( 4. font_scale=1.5, 5. , 6. rc={'figure.figsize':(20,7)} 7. ) 8. g = sns.barplot(x=x_col, y=y_col, data=data, color='royalblue') 9. 10. for p in g.patches: 11. g.annotate( 12. format(p.get_height(), '.2%'), 13. (p.get_x() + p.get_width() / 2., p.get_height()), 14. ha = 'center', 15. va = 'center', 16. xytext = (0, 10), 17. textcoords = 'offset points' 18. ) 19. 20. vals = g.get_yticks() 21. g.set_yticklabels(['{:,.0f}%'.format(x*100) for x in vals]) 22. 23. sns.despine()
1. orders_with_meals['Type of Meal'].fillna('no meal',inplace=True) 2. _ = orders_with_meals.groupby('Type of Meal').agg({'Converted': np.mean}) 3. plot_bars(_,x_col='Type of Meal',y_col='Converted')
1. _ = orders_with_meals.groupby(['Days of meal before order']).agg( 2. {'Converted': np.mean} 3. ) 4. plot_bars(data=_,x_col='Days of meal before order',y_col='Converted'))
1. def draw_heatmap(data,inner_row, inner_col, outer_row, outer_col, values): 2. sns.set(font_scale=1) 3. fg = sns.FacetGrid( 4. data, 5. row=outer_row, 6. col=outer_col, 7. margin_titles=True 8. ) 9. 10. position = left, bottom, width, height = 1.4, .2, .1, .6 11. cbar_ax = fg.fig.add_axes(position) 12. 13. fg.map_dataframe( 14. draw_heatmap_facet, 15. x_col=inner_col, 16. y_col=inner_row, 17. values=values, 18. cbar_ax=cbar_ax, 19. vmin=0, 20. vmax=.4 21. ) 22. 23. fg.fig.subplots_adjust(right=1.3) 24. plt.show() 25. 26. def draw_heatmap_facet(*args, **kwargs): 27. data = kwargs.pop('data') 28. x_col = kwargs.pop('x_col') 29. y_col = kwargs.pop('y_col') 30. values = kwargs.pop('values') 31. d = data.pivot(index=y_col, columns=x_col, values=values) 32. annot = round(d,4).values 33. cmap = sns.color_palette("RdYlGn",30) 34. # cmap = sns.color_palette("PuBu",30) alternative color coding 35. sns.heatmap(d, **kwargs, annot=annot, center=0, fmt=".1%", cmap=cmap, linewidth=.5)
1. # Aggregate the data a bit 2. orders_with_meals['Meal Price / Order Value'] = orders_with_meals['Meal Price']/orders_with_meals['Order Value'] 3. orders_with_meals['Meal Price / Order Value'] = pd.qcut( 4. orders_with_meals['Meal Price / Order Value']*-1, 5. 5, 6. labels = ['Least Expensive','Less Expensive','Proportional','More Expensive','Most Expensive'][::-1] 7. ) 8. 9. orders_with_meals['Timing of Meal'] = pd.qcut( 10. orders_with_meals['Days of meal before order'], 11. 3, 12. labels = ['After Order','Around Order','Before Order'] 13. ) 14. 15. 16. data = orders_with_meals[orders_with_meals['Type of Meal'] != 'no meal'].groupby( 17. ['Timing of Meal','Number Participants','Type of Meal','Meal Price / Order Value'] 18. ).agg({'Converted': np.mean}).unstack().fillna(0).stack().reset_index()
1. draw_heatmap( 2. data=data, 3. outer_row='Timing of Meal', 4. outer_col='Type of Meal', 5. inner_row='Meal Price / Order Value', 6. inner_col='Number Participants', 7. values='Converted' 8. )
热图很漂亮,但一开始有点难以解读。让我们来看一下。
图表总结了4个不同维度的影响:
用餐时间:订单后、订单前后、订单前(外行)
用餐类型:早餐、晚餐、午餐(外列)
餐费/订单价值:最低、较低、成比例、较贵、最贵(内行)
参加人数:1,2,3,4,5(内列)
当然,看起来图表底部的颜色更暗/更高,这表明
在订单前用餐的转化率更高
似乎晚餐的转换率更高,当只有一个人用餐时
看起来相对订单价值,更贵的餐费对转化率有积极的影响
…
1.不要给你的销售代表超过9个客户(因为转化率下降很快);
2.确保每个订单线索都有会议/用餐(因为这使转化率翻倍);
3.当只分配一名员工给顾客时,晚餐是最有效的;
4.你的销售代表应该支付大约为订单价值8%到10%的餐费;
5.时机是关键,理想情况下,让你的销售代表尽早知道交易即将达成。
地址:https://github.com/FBosler/Medium-Data-Exploration
热图备注:
为了解决潜在的格式化问题,你可以先将其卸载(必须在终端中完成),然后再运行以下语句,将MaMattLIB降级到版本3.1.0:
!pip install matplotlib==3.1.0
如果有任何问题请联系我。
译者简介
车前子,北大医学部,流行病与卫生统计专业博二在读。从临床医学半路出家到数据挖掘,感到了数据分析的艰深和魅力。即使不做医生,也希望用数据为医疗健康做一点点贡献。
翻译组招募信息
工作内容:需要一颗细致的心,将选取好的外文文章翻译成流畅的中文。如果你是数据科学/统计学/计算机类的留学生,或在海外从事相关工作,或对自己外语水平有信心的朋友欢迎加入翻译小组。
你能得到:定期的翻译培训提高志愿者的翻译水平,提高对于数据科学前沿的认知,海外的朋友可以和国内技术应用发展保持联系,THU数据派产学研的背景为志愿者带来好的发展机遇。
其他福利:来自于名企的数据科学工作者,北大清华以及海外等名校学生他们都将成为你在翻译小组的伙伴。
点击文末“阅读原文”加入数据派团队~
点击“阅读原文”拥抱组织
转载地址:http://sxuqi.baihongyu.com/