可以通过多种方式将Pandas对象联合到一起:
merge方法将两个pandas对象连接在一起,类似SQL的连接操作。默认情况下,它执行的是内连接,也就是两个对象的交集。通过参数how,还可以指定外连接、左连接和右连接。参数on指定在哪个键上连接,参数left_on
和right_on
分别指定左右对象的连接键。
In [23]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], ...: 'data1': range(7)}) ...: In [24]: df2 = pd.DataFrame({'key': ['a', 'b', 'd'], ...: 'data2': range(3)}) ...: In [25]: df1 Out[25]: key data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 a 5 6 b 6 In [26]: df2 Out[26]: key data2 0 a 0 1 b 1 2 d 2 In [27]: pd.merge(df1,df2) # 默认内链接,并智能地查找连接的键 Out[27]: key data1 data2 0 b 0 1 1 b 1 1 2 b 6 1 3 a 2 0 4 a 4 0 5 a 5 0 In [28]: pd.merge(df1,df2,on='key') # 最好是显式地指定连接的键 Out[28]: key data1 data2 0 b 0 1 1 b 1 1 2 b 6 1 3 a 2 0 4 a 4 0 5 a 5 0 In [30]: pd.merge(df1, df2, how='outer') # 外连接 Out[30]: key data1 data2 0 b 0.0 1.0 1 b 1.0 1.0 2 b 6.0 1.0 3 a 2.0 0.0 4 a 4.0 0.0 5 a 5.0 0.0 6 c 3.0 NaN 7 d NaN 2.0 In [31]: pd.merge(df1, df2, how='left') # 左连接 Out[31]: key data1 data2 0 b 0 1.0 1 b 1 1.0 2 a 2 0.0 3 c 3 NaN 4 a 4 0.0 5 a 5 0.0 6 b 6 1.0 In [32]: pd.merge(df1, df2, how='right') #右连接 Out[32]: key data1 data2 0 b 0.0 1 1 b 1.0 1 2 b 6.0 1 3 a 2.0 0 4 a 4.0 0 5 a 5.0 0 6 d NaN 2 In [33]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], ...: 'data1': range(7)}) ...: In [34]: df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], ...: 'data2': range(3)}) ...: In [35]: pd.merge(df3, df4, left_on='lkey', right_on='rkey') # 指定两边的键 Out[35]: lkey data1 rkey data2 0 b 0 b 1 1 b 1 b 1 2 b 6 b 1 3 a 2 a 0 4 a 4 a 0 5 a 5 a 0
多对多的merge连接是行的笛卡儿积。比如左边如果有3个‘b’行,右边有2个‘b’行,那么结果是3x2,6个‘b’行。
也可以同时指定多个键进行连接:
In [36]: left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'], ...: 'key2': ['one', 'two', 'one'], ...: 'lval': [1, 2, 3]}) In [37]: right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], ...: 'key2': ['one', 'one', 'one', 'two'], ...: 'rval': [4, 5, 6, 7]}) In [38]: pd.merge(left, right, on=['key1', 'key2'], how='outer') Out[38]: key1 key2 lval rval 0 foo one 1.0 4.0 1 foo one 1.0 5.0 2 foo two 2.0 NaN 3 bar one 3.0 6.0 4 bar two NaN 7.0
merge操作中还有一个重叠列名的问题,比如上面的left和right两个数据,为此,我们可以使用suffixes参数,手动指定为重复的列名添加后缀:
In [41]: pd.merge(left, right, on='key1') Out[41]: key1 key2_x lval key2_y rval 0 foo one 1 one 4 1 foo one 1 one 5 2 foo two 2 one 4 3 foo two 2 one 5 4 bar one 3 one 6 5 bar one 3 two 7 In [42]: pd.merge(left, right, on='key1', suffixes=('_left', '_right')) Out[42]: key1 key2_left lval key2_right rval 0 foo one 1 one 4 1 foo one 1 one 5 2 foo two 2 one 4 3 foo two 2 one 5 4 bar one 3 one 6 5 bar one 3 two 7
有时候,用于merge合并的键可能是某个对象的行索引:
In [43]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], ...: 'value': range(6)}) ...: In [44]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b']) In [45]: left1 Out[45]: key value 0 a 0 1 b 1 2 a 2 3 a 3 4 b 4 5 c 5 In [46]: right1 Out[46]: group_val a 3.5 b 7.0 In [47]: pd.merge(left1, right1, left_on='key', right_index=True) Out[47]: key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0 In [48]: pd.merge(left1, right1, left_on='key', right_index=True, how='outer') Out[48]: key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0 5 c 5 NaN
使用right_index=True
参数显式地指出,右边的对象right1使用它的行索引作为连接的键。
事实上Pandas有一个join方法,可以帮助我们直接用行索引进行连接:
In [49]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], ...: index=['a', 'c', 'e'], ...: columns=['Ohio', 'Nevada']) ...: In [50]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], ...: index=['b', 'c', 'd', 'e'], ...: columns=['Missouri', 'Alabama']) ...: In [51]: left2 Out[51]: Ohio Nevada a 1.0 2.0 c 3.0 4.0 e 5.0 6.0 In [52]: right2 Out[52]: Missouri Alabama b 7.0 8.0 c 9.0 10.0 d 11.0 12.0 e 13.0 14.0 In [53]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True) Out[53]: Ohio Nevada Missouri Alabama a 1.0 2.0 NaN NaN b NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 d NaN NaN 11.0 12.0 e 5.0 6.0 13.0 14.0 In [54]: left2.join(right2, how='outer') # 与上面的操作效果一样 Out[54]: Ohio Nevada Missouri Alabama a 1.0 2.0 NaN NaN b NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 d NaN NaN 11.0 12.0 e 5.0 6.0 13.0 14.0
Series对象name属性的值必须不为None才能使用pd.merge连接,且必须设置参数left_index=True,right_index=True
垂直合并