Reply to pandas vs sql on Tue, 30 Nov 2021 06:27:53 GMT

anneng — Tue, 30 Nov 2021 06:27:53 GMT

https://drops.dagstuhl.de/opus/volltexte/2020/11960/pdf/OASIcs-PLATEAU-2019-6.pdf

Reply to pandas vs sql on Thu, 26 Aug 2021 03:38:39 GMT

anneng — Thu, 26 Aug 2021 03:38:39 GMT

https://pandas.pydata.org/docs/user_guide/scale.html?highlight=postgresql

使用pandas处理大规模数据
pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets that are a sizable fraction of memory become unwieldy, as some pandas operations need to make intermediate copies.

This document provides a few recommendations for scaling your analysis to larger datasets. It’s a complement to Enhancing performance, which focuses on speeding up analysis for datasets that fit in memory.

But first, it’s worth considering not using pandas. pandas isn’t the right tool for all situations. If you’re working with very large datasets and a tool like PostgreSQL fits your needs, then you should probably be using that. Assuming you want or need the expressiveness and power of pandas, let’s carry on.
由于pandas是在内存中进行计算的　因此当数据量太大时　pandas建议了几种方式比较好的方式就是大数据存储　然后parquet进行数据块切分　使用pandas处理数据　或者用Dask（pandas接口风格一直）进行多线程或者跨集群的大规模并行处理