离散化,就是将连续值转换为一个个区间内,形成一个个分隔的‘箱子’。假设我们有下面的一群人的年龄数据,想将它们进行分组,并放入离散的年龄箱内:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
我们预先定义18~25、26~35、36~60以及61及以上等若干组。
Pandas提供一个cut方法,帮助我们实现分箱功能:
In [93]: bins = [18,25,35,60,100] In [94]: cats = pd.cut(ages,bins) In [95]: cats Out[95]: [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
返回的cats是一个特殊的Categorical对象,输出描述了12个年龄值分别处于哪个箱子中。cats包含一系列的属性:
In [96]: cats.codes Out[96]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8) In [97]: cats.categories Out[97]: IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed='right', dtype='interval[int64]') In [98]: cats.describe Out[98]: <bound method Categorical.describe of [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]> In [99]: pd.value_counts(cats) # 各个箱子的数量 Out[99]: (18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64
分箱的区间通常是左开右闭的,如果想变成左闭右开,请设置参数right=False。
可以定义labels参数,来自定义每种箱子的名称:
In [100]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']^M ...: pd.cut(ages, bins, labels=group_names) ...: ...: Out[100]: [Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult] Length: 12 Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
如果你不提供分箱的区间定义,而是直接要求分隔成整数个等分区间,可以这么做:
In [101]: d =np.random.rand(20) In [102]: d Out[102]: array([0.83732945, 0.0850416 , 0.66540597, 0.90479238, 0.99222014, 0.39409122, 0.91896172, 0.87163655, 0.31374598, 0.27726111, 0.7716572 , 0.79131961, 0.42805445, 0.29934685, 0.19077374, 0.79701771, 0.93789892, 0.93536338, 0.32299602, 0.305671 ]) In [103]: pd.cut(d, 4, precision=2) # 精度限制在两位 Out[103]: [(0.77, 0.99], (0.084, 0.31], (0.54, 0.77], (0.77, 0.99], (0.77, 0.99], ..., (0.77, 0.99], (0.77, 0.99], (0.77, 0.99], (0.31, 0.54], (0.084, 0.31]] Length: 20 Categories (4, interval[float64]): [(0.084, 0.31] < (0.31, 0.54] < (0.54, 0.77] < (0.77, 0.99]]
cut函数执行的时候,分箱区间要么是你指定的,要么是均匀大小的。还有一种分箱方法叫做qcut,它是使用样本的分位数来分割的,而不是样本值的大小。比如下面的操作,将使每个箱子中元素的个数相等:
In [104]: data = np.random.randn(1000) In [105]: cats = pd.qcut(data,4) In [106]: cats Out[106]: [(0.644, 2.83], (-0.0344, 0.644], (-0.0344, 0.644], (-0.734, -0.0344], (-0.734, -0.0344], ..., (-3.327, -0.734], (-0.734, -0.0344], (0.644, 2.83], (-0.734, -0.0344], (-0.0344, 0.644]] Length: 1000 Categories (4, interval[float64]): [(-3.327, -0.734] < (-0.734, -0.0344] < (-0.0344, 0.644] < (0.644, 2.83]] In [108]: pd.value_counts(cats) # 各箱子中的元素个数相同 Out[108]: (0.644, 2.83] 250 (-0.0344, 0.644] 250 (-0.734, -0.0344] 250 (-3.327, -0.734] 250 dtype: int64
qcut还可以自定义0~1之间的分位数:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
离散化函数对于分位数和分组分析特别有用。
分组