补全缺失值

阅读: 4111     评论:0

所有数据都是宝贵的,大多数时候,我们不希望丢弃原始数据,而是补全缺失值。

fillna是补全缺失值的方法,为它提供一个固定值即可:

In [31]: df = pd.DataFrame(np.random.randn(7,3))

In [32]: df
Out[32]:
          0         1         2
0 -0.229682 -0.483246 -0.063835
1  0.716649  1.593639 -1.364550
2 -1.362614  1.628310 -1.617992
3  1.128828 -1.120265 -0.657313
4  1.078143  1.136835 -0.427125
5  0.441696  0.219477  0.695700
6 -0.501183  1.453678 -2.734985

In [33]: df.iloc[:4, 1] = NA

In [35]: df.iloc[:2, 2] = NA

In [36]: df
Out[36]:
          0         1         2
0 -0.229682       NaN       NaN
1  0.716649       NaN       NaN
2 -1.362614       NaN -1.617992
3  1.128828       NaN -0.657313
4  1.078143  1.136835 -0.427125
5  0.441696  0.219477  0.695700
6 -0.501183  1.453678 -2.734985

In [37]: df.fillna(0)
Out[37]:
          0         1         2
0 -0.229682  0.000000  0.000000
1  0.716649  0.000000  0.000000
2 -1.362614  0.000000 -1.617992
3  1.128828  0.000000 -0.657313
4  1.078143  1.136835 -0.427125
5  0.441696  0.219477  0.695700
6 -0.501183  1.453678 -2.734985

也可以提供一个字典,为不同的列设定不同的填充值。

In [38]: df.fillna({1:1, 2:2})
Out[38]:
          0         1         2
0 -0.229682  1.000000  2.000000
1  0.716649  1.000000  2.000000
2 -1.362614  1.000000 -1.617992
3  1.128828  1.000000 -0.657313
4  1.078143  1.136835 -0.427125
5  0.441696  0.219477  0.695700
6 -0.501183  1.453678 -2.734985

当然,fillna也不会原地修改数据,如果你想,请使用inplace参数:

In [39]: _ = df.fillna(0, inplace=True)

In [40]: df
Out[40]:
          0         1         2
0 -0.229682  0.000000  0.000000
1  0.716649  0.000000  0.000000
2 -1.362614  0.000000 -1.617992
3  1.128828  0.000000 -0.657313
4  1.078143  1.136835 -0.427125
5  0.441696  0.219477  0.695700
6 -0.501183  1.453678 -2.734985

也可以使用ffill和bfill这种插值法填充缺失值:

In [41]: df = pd.DataFrame(np.random.randn(6,3))

In [42]: df.iloc[2:, 1]=NA

In [43]: df.iloc[4:, 2]=NA

In [44]: df
Out[44]:
          0         1         2
0 -0.858762  0.083342 -0.315598
1 -0.211846  0.076648  1.188298
2 -0.513364       NaN  0.079216
3  0.398399       NaN -0.290225
4 -1.375898       NaN       NaN
5  0.932812       NaN       NaN

In [45]: df.fillna(method='ffill') # 使用前一个值进行填充
Out[45]:
          0         1         2
0 -0.858762  0.083342 -0.315598
1 -0.211846  0.076648  1.188298
2 -0.513364  0.076648  0.079216
3  0.398399  0.076648 -0.290225
4 -1.375898  0.076648 -0.290225
5  0.932812  0.076648 -0.290225

In [46]: df.fillna(method='ffill',limit=2)  # 限制填充次数
Out[46]:
          0         1         2
0 -0.858762  0.083342 -0.315598
1 -0.211846  0.076648  1.188298
2 -0.513364  0.076648  0.079216
3  0.398399  0.076648 -0.290225
4 -1.375898       NaN -0.290225
5  0.932812       NaN -0.290225

In [47]: df.fillna(method='bfill')  # 后向填充此时无效
Out[47]:
          0         1         2
0 -0.858762  0.083342 -0.315598
1 -0.211846  0.076648  1.188298
2 -0.513364       NaN  0.079216
3  0.398399       NaN -0.290225
4 -1.375898       NaN       NaN
5  0.932812       NaN       NaN

其实使用fillna有很多技巧,需要大家平时多收集多尝试,比如使用平均值来填充:

In [48]: s = pd.Series([1, NA, 3.5, NA, 7])

In [49]: s.fillna(s.mean())
Out[49]:
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

 删除缺失值 删除重复值 

评论总数: 0


点击登录后方可评论