pandas vs sql
-
-
https://pandas.pydata.org/docs/user_guide/scale.html?highlight=postgresql
使用pandas处理大规模数据
pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets that are a sizable fraction of memory become unwieldy, as some pandas operations need to make intermediate copies.This document provides a few recommendations for scaling your analysis to larger datasets. It’s a complement to Enhancing performance, which focuses on speeding up analysis for datasets that fit in memory.
But first, it’s worth considering not using pandas. pandas isn’t the right tool for all situations. If you’re working with very large datasets and a tool like PostgreSQL fits your needs, then you should probably be using that. Assuming you want or need the expressiveness and power of pandas, let’s carry on.
由于pandas是在内存中进行计算的 因此当数据量太大时 pandas建议了几种方式 比较好的方式就是大数据存储 然后parquet进行数据块切分 使用pandas处理数据 或者用Dask(pandas接口风格一直)进行多线程或者跨集群的大规模并行处理 -