Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (2024)

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (1)

Step 1: Build & install DuckDB FDW into PostgreSQL

We begin by installing DuckDB on our system and the PostgreSQL extension. Detailed installation instructions are available on the related repo.

Step 2: Configure PostgreSQL

In this step, we configure PostgreSQL to enable the loading of Parquet data as foreign tables. We create a server for DuckDB, specifying connection details to our DuckDB instance.

CREATE EXTENSION duckdb_fdw;
CREATE SERVER duckdb_svr \
 FOREIGN DATA WRAPPER duckdb_fdw OPTIONS (database ':memory:');

The special value :memory: is used to create an in-memory database where no data is persisted to disk (i.e., all data is lost when you exit the process). This setting is right for our use case, we only want to read Parquet files.

Step 3: Create a Foreign Table

With the configuration in place, we create a foreign table in PostgreSQL that represents the Parquet dataset. This foreign table establishes a seamless connection to DuckDB, providing direct access to the data within the Parquet files.

CREATE FOREIGN TABLE public.taxi_table (
 vendor_id text,
 pickup_at timestamp,
 dropoff_at timestamp,
 passenger_count int,
 trip_distance double precision,
 rate_code_id text,
 store_and_fwd_flag text,
 pickup_location_id int,
 dropoff_location_id int,
 payment_type text,
 fare_amount double precision,
 extra double precision,
 mta_tax double precision,
 tip_amount double precision,
 tolls_amount double precision,
 improvement_surcharge double precision,
 total_amount double precision,
 congestion_surcharge double precision
)
SERVER duckdb_svr
OPTIONS (
 table 'read_parquet("/home/xyz/Downloads/taxi-data/*.parquet")'
);

Step 4: Query the Data

Now that we have set up the foreign table, we can run SQL queries against it, benefiting from DuckDB’s outstanding analytical query processing performance.

SELECT count(*) FROM public.taxi_table;

Wow! 21 million records in less than 400 MB of Parquet file size. Impressive data compression.

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (3)

We’re going to check the SQL execution plan to see what PostgreSQL is doing:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (4)

Great, PostgreSQL completely delegates the execution to the extension, and therefore to DuckDB. Response time is fantastic, well done DuckDB!

Let’s see what’s in the tables:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (5)

Let’s try something more complex…

SELECT 
 vendor_id, passenger_count, count(*) 
FROM 
 public.taxi_table 
GROUP BY 
 vendor_id, passenger_count 
ORDER BY 
 vendor_id, passenger_count
;

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (6)

SELECT 
 * 
FROM 
 public.taxi_table 
WHERE 
 vendor_id = '1' 
 AND 
 pickup_at >= '2019-06-01 00:00:00' 
 AND 
 dropoff_at <= '2019-06-02 00:00:00'
;

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (7)

Well, around 4 seconds to retrieve 90,000 records is not bad, considering that the data is not partitioned (for example, by the ‘vendor_id’ column), I’m retrieving all fields from the table (not the best practice, you should always fetch only what you need), and the files are on my old Laptop with a 9-year-old SATA disk that has seen better days ;-)

Let’s check the SQL execution plan:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (8)

Alright, everything delegated to DuckDB, PostgreSQL just waiting for results.

Let’s try an aggregate function in the SELECT clause:

SELECT 
 vendor_id, 
 min(passenger_count) AS min_pgc, 
 max(passenger_count) AS max_pgc, 
 count(*) 
FROM 
 public.taxi_table 
GROUP BY 
 vendor_id 
ORDER BY 
 vendor_id
;

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (9)

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (10)

Everything’s in order.

Step 5: When things are not so pretty

Let’s see what PostgreSQL does with this query:

SELECT 
 vendor_id, 
 passenger_count, 
 count(*) AS nr, 
 avg(trip_distance) AS avg_dist, 
 array_agg(trip_distance) AS distances
FROM 
 public.taxi_table 
WHERE 
 pickup_at >= '2019-06-01 00:00:00' 
 AND 
 dropoff_at <= '2019-06-01 00:02:00'
GROUP BY
 vendor_id, passenger_count
ORDER BY
 vendor_id, passenger_count
;

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (11)

Perfect results, but let’s check the execution plan:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (12)

PostgreSQL has broken down the query into two stages. The first one sent to DuckDB applies only the WHERE clause without grouping, and the second stage is executed with the PostgreSQL engine, where grouping is applied to the first subset of records obtained:

First stage in the SQL execution plan (For DuckDB):

-> Foreign Scan:
 SELECT 
 "vendor_id", 
 "passenger_count", 
 "trip_distance" 
 FROM 
 read_parquet("/data/taxi-data/*.parquet") 
 WHERE
 "pickup_at" >= '2019-06-01 00:00:00'
 AND
 "dropoff_at" <= '2019-06-01 00:02:00' 
 ORDER BY 
 "vendor_id" ASC NULLS LAST, "passenger_count" ASC NULLS LAST

Second stage (For PostgreSQL):

-> GroupAggregate:
 Output: 
 vendor_id, 
 passenger_count, 
 count(*), 
 avg(trip_distance), 
 array_agg(trip_distance)
 Group Key: 
 taxi_table.vendor_id, taxi_table.passenger_count

It doesn’t seem too serious, but consider a scenario where we have a query that, when grouped, returns very few records, but without grouping (what DuckDB is returning for PostgreSQL to perform the final grouping), there are thousands and thousands of records. We are wasting resources, sending many data between stages, and degrading response time; not leveraging the columnar data structure of Parquet and performance of DuckDB, right?

Let’s take a look at the source code of the plugin to see what’s happening. The extension only supports a subset of functions in the SELECT clause, and “array_agg” is not among them (only “sum”, “min”, “max”, “avg” and “count” are managed). So, PostgreSQL can’t delegate the grouping to DuckDB because the plugin doesn’t support that function. It makes sense what’s happening.

But why not change this? Adding that function alongside the plugin would be a significant improvement in many scenarios, wouldn’t it? Let’s go for it.

I have implemented the code (available in this pull request) mostly to add support to manage list types in the plugin. Functions like “array_agg” return arrays of values. Let’s see what happens with the query again:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (13)

Great, PostgreSQL now recognizes that the “array_agg” function can also be processed in the extension, so it delegates all the work to the remote server and thus to DuckDB. Everything is executed in a single step, and the crucial part is that grouping is applied during the reading from the Parquet data source.

Yes, I know, this query doesn’t get the performance almost improved, but in this case, the amount of data I’m retrieving is minimal. I simply used it as an example for educational purposes. I’m sure you’ll find a use case that can benefit from it and show the improvement :-)

It’s been a fun journey for me, and I hope it has been interesting for you as well. If the code is deemed valuable, I hope it will be merged into the repository.

In conclusion, by integrating DuckDB FDW in PostgreSQL to load Parquet files as foreign tables, we unlock a powerful solution for data analytics. This combination of PostgreSQL’s reliability, Parquet’s storage efficiency, and DuckDB’s query processing speed elevates data management and analysis to a new level. It’s an ideal choice for organizations looking to enhance their data analytics while maintaining their existing PostgreSQL infrastructure.

Thanks to people and organizations who develop amazing open source stuff; PostgresSQL, DuckDB, Parquet are indeed essential components in the world of data tooling, and they’ve made a significant impact on data management and analysis.
Thanks to the team in duckdb_fdw for developing this great piece of software, this integration is very useful!

Open source communities thrive on recognition and support from users like you!