Dec 2, 2025

Use duckdb in pipeline

Frequently I am using pandas and polars a lot, and memorying the syntax is not that bad. There’s actually another package: duckdb where the SQL syntax can be used so that I don’t need to memorize this groupby or pivot functions. (Maybe i still do but less often?)

I see myself use duckdb in this way

1
import duckdb
2

3
# create an empty connection and then just querying the table
4
con = duckdb.connect()
5
con.sql("SELECT * FROM 'taxi_2019_04.parquet' LIMIT 5").show()
6

7
# or i can save it to a polars dataframe
8
df = con.sql("SELECT * FROM 'taxi_2019_04.parquet' LIMIT 5").pl()
9

10
# then save it to a csv file
11
con.sql("SELECT * FROM 'taxi_2019_04.parquet' LIMIT 5").to_csv('test.csv')

Some of common actions i do like check columns and their specs

1
con.sql("DESCRIBE SELECT * FROM 'test.parquet'")
2

3
# this is like describe with stats
4
con.sql("SUMMARIZE SELECT * FROM 'taxi_2019_04.parquet'").show()
5

6
# or if it's parquet
7
con.sql("select * from parquet_schema('test.parquet')")

Some workflow

duckdb will be better to deal with different input data files, like a mix of database, s3, csv, and parquet.

Then I can do something like

1
con = duckdb.connect()
2

3
query = """
4
select
5
    a.*
6
    ,b.t1
7
    ,b.t2
8

9
    from 'test.csv' as a
10

11
    left join 'test1.parquet' as b
12

13
    on a.key1 = b.key1
14
    and a.key2 = b.key2
15

16
"""
17
df = con.sql(query).pl()

When there’s a lot of testing flow and need to comeback at some intermediate data tables, there can be a persistent table.

1
con = duckdb.connect('test.db', overwritten=True)
2

3
con.sql("create view v1 as select * from 'test.csv' limit 10")
4

5
con.sql("create view v2 as select * from 'test.csv' limit 3")
6

7
con.close()
8

9
# then later i can come back at it
10
con1 = duckdb.connect("my_analysis.db")
11
con1.sql("show tables").show()