Use duckdb in pipeline
Frequently I am using pandas and polars a lot, and memorying the syntax is not that bad. There’s actually another package: duckdb where the SQL syntax can be used so that I don’t need to memorize this groupby or pivot functions. (Maybe i still do but less often?)
I see myself use duckdb in this way
import duckdb
# create an empty connection and then just querying the tablecon = duckdb.connect()con.sql("SELECT * FROM 'taxi_2019_04.parquet' LIMIT 5").show()
# or i can save it to a polars dataframedf = con.sql("SELECT * FROM 'taxi_2019_04.parquet' LIMIT 5").pl()
# then save it to a csv filecon.sql("SELECT * FROM 'taxi_2019_04.parquet' LIMIT 5").to_csv('test.csv')Some of common actions i do like check columns and their specs
con.sql("DESCRIBE SELECT * FROM 'test.parquet'")
# this is like describe with statscon.sql("SUMMARIZE SELECT * FROM 'taxi_2019_04.parquet'").show()
# or if it's parquetcon.sql("select * from parquet_schema('test.parquet')")Some workflow
duckdb will be better to deal with different input data files, like a mix of database, s3, csv, and parquet.
Then I can do something like
con = duckdb.connect()
query = """select a.* ,b.t1 ,b.t2
from 'test.csv' as a
left join 'test1.parquet' as b
on a.key1 = b.key1 and a.key2 = b.key2
"""df = con.sql(query).pl()When there’s a lot of testing flow and need to comeback at some intermediate data tables, there can be a persistent table.
con = duckdb.connect('test.db', overwritten=True)
con.sql("create view v1 as select * from 'test.csv' limit 10")
con.sql("create view v2 as select * from 'test.csv' limit 3")
con.close()
# then later i can come back at itcon1 = duckdb.connect("my_analysis.db")con1.sql("show tables").show()