当我们处理大型文件的时候,读入文件的一个小片段或者按小块遍历文件是比较好的做法。
在这之前,我们最好先对Pandas的显示设置进行调整,使之更为紧凑:
In [45]: pd.options.display.max_rows = 10
这样,即使是大文件,最多也只会显式10行具体内容。
In [46]: result = pd.read_csv('d:/ex6.csv') In [47]: result Out[47]: one two three four key 0 0.467976 -0.038649 -0.295344 -1.824726 L 1 -0.358893 1.404453 0.704965 -0.200638 B 2 -0.501840 0.659254 -0.421691 -0.057688 G 3 0.204886 1.074134 1.388361 -0.982404 R 4 0.354628 -0.133116 0.283763 -0.837063 Q ... ... ... ... ... .. 9995 2.311896 -0.417070 -1.409599 -0.515821 L 9996 -0.479893 -0.650419 0.745152 -0.646038 E 9997 0.523331 0.787112 0.486066 1.093156 K 9998 -0.362559 0.598894 -1.843201 0.887292 G 9999 -0.096376 -1.012999 -0.657431 -0.573315 0 [10000 rows x 5 columns]
或者使用nrows参数,指明从文件开头往下只读n行:
In [48]: result = pd.read_csv('d:/ex6.csv',nrows=5) In [49]: result Out[49]: one two three four key 0 0.467976 -0.038649 -0.295344 -1.824726 L 1 -0.358893 1.404453 0.704965 -0.200638 B 2 -0.501840 0.659254 -0.421691 -0.057688 G 3 0.204886 1.074134 1.388361 -0.982404 R 4 0.354628 -0.133116 0.283763 -0.837063 Q
或者指定chunksize作为每一块的行数,分块读入文件:
In [50]: chunker = pd.read_csv('d:/ex6.csv', chunksize=1000) In [51]: chunker Out[51]: <pandas.io.parsers.TextFileReader at 0x2417d6cfb38>
上面的TextFileReader对象是一个可迭代对象。例如我们可以遍历它,并对‘key’列进行聚合获得计数值:
In [52]: total = pd.Series([]) In [53]: for piece in chunker: ...: total = total.add(piece['key'].value_counts(), fill_value=0) ...: total = total.sort_values(ascending=False) In [54]: total Out[54]: E 368.0 X 364.0 L 346.0 O 343.0 Q 340.0 ... 5 157.0 2 152.0 0 151.0 9 150.0 1 146.0 Length: 36, dtype: float64 In [55]: total[:10] Out[55]: E 368.0 X 364.0 L 346.0 O 343.0 Q 340.0 M 338.0 J 337.0 F 335.0 K 334.0 H 330.0 dtype: float64