🔥 Burn Fat Fast. Discover How! 💪

Python and SQL combo with FugueSQL: a single SQL interface for | Big Data Science

Python and SQL combo with FugueSQL: a single SQL interface for Pandas, Spark and Dask dateframes
FugueSQL
is an open source Python library that allows you to combine Python code with an SQL command by switching between them in a Jupyter Notebook or Python script. FugueSQL supports distributed computing and provides a unified API to run the same SQL code in Pandas, Dask, and Apache Spark.
Unlike PandaSQL, which has a single SQLite server, which introduces a lot of overhead when transferring data between Pandas and the database, FugueSQL supports multiple local backends: pandas, DuckDB, and SQLite.
When using the pandas backend, Fugue translates SQL directly into pandas operations, excluding data transfers. DuckDB has excellent panda support so the data transfer overhead is negligible. Both Pandas and DuckDB are the preferred FugueSQL server-side modules for local data processing. Fugue also supports Spark, Dask, and cuDF (via blazingSQL) as backends.
The Fugue SQL code is parsed using ANTLR and mapped to equivalent functions in the Fugue API. FugueSQL has many features built in and the code is extensible with Python-code. By default, it supports the most common features of the function: filling in nulls, deleting nulls, renaming columns, changing the schema, and more. Fugue also improves on some improvements in standard SQL to handle end-to-end data workflows gracefully. For example, creating intermediate tables by assigning a number.
As Pandas %% fsql accepts NativeExecutionEngine as a default parameter. In Dask, FugueSQL is slightly slower than the native engine, but more complete in terms of implemented SQL keywords. FugueSQL also runs on Spark, mapping %% fsql operations to Spark and Spark SQL operations. This allows you to quickly develop distributed applications. All you have to do is create a local prototype using the NativeExecutionEngine, test it, and deploy it to the Spark cluster just by changing the execution engine.
https://towardsdatascience.com/introducing-fuguesql-sql-for-pandas-spark-and-dask-dataframes-63d461a16b27
https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html