Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (2024)

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (1)

In the realm of data management and analytics, PostgreSQL has long been a popular choice as an open-source relational database management system. It offers robust features for handling structured data. However, when it comes to managing and analyzing large volumes of data efficiently, other technologies like Parquet and DuckDB have made their mark.

In this article, we’ll delve into the process of integrating DuckDB into PostgreSQL to load Parquet files as foreign tables, providing a powerful solution for data analytics.

One of the crucial elements that make this integration possible is the DuckDB FDW (Foreign Data Wrapper), an opensource PostgreSQL extension that allows us to load DuckDB into PostgreSQL.

Before we embark on our exploration, let’s briefly introduce the key components involved:

  1. PostgreSQL: PostgreSQL is a powerful, open-source relational database management system known for its reliability and scalability. It excels in handling structured data and is a top choice for many organizations.
  2. Parquet: Parquet is an open-source columnar storage file format that excels in storing and processing large datasets efficiently. It is particularly well-suited for analytical workloads and is a popular choice for data warehousing and data lakes.
  3. DuckDB: DuckDB is another open-source project that focuses on analytical query processing. It is designed for high-performance analytical workloads, making it an ideal candidate for accelerating complex queries.
  4. DuckDB FDW: Foreign data wrapper extension to connect PostgreSQL to DuckDB databases.

The integration of DuckDB into PostgreSQL allows you to load Parquet files as foreign tables. This brings the power of both DuckDB’s query processing capabilities and Parquet’s efficient storage format to PostgreSQL.

In the fantastic article by Paul Ramsey, we can learn about other similar integration, but in this case, using the parquet-fdw plugin to load Parquet into the database.

In this article, we will utilize the DuckDB FDW to load data via the DuckDB engine and harness the excellent data access performance that this in-process SQL OLAP database offers.

To test the integration, we need a set of input data. We will use data with a substantial number of records to evaluate performance. For our testing, we will utilize a subset of the renowned ‘NYC Taxi Data’ dataset, consisting of three files named “taxi_2019_XX.parquet,” which are available for download from the following link.

Let’s proceed with the process!

Step 1: Build & install DuckDB FDW into PostgreSQL

We begin by installing DuckDB on our system and the PostgreSQL extension. Detailed installation instructions are available on the related repo.

Step 2: Configure PostgreSQL

In this step, we configure PostgreSQL to enable the loading of Parquet data as foreign tables. We create a server for DuckDB, specifying connection details to our DuckDB instance.

CREATE EXTENSION duckdb_fdw;
CREATE SERVER duckdb_svr \
FOREIGN DATA WRAPPER duckdb_fdw OPTIONS (database ':memory:');

The special value :memory: is used to create an in-memory database where no data is persisted to disk (i.e., all data is lost when you exit the process). This setting is right for our use case, we only want to read Parquet files.

Step 3: Create a Foreign Table

With the configuration in place, we create a foreign table in PostgreSQL that represents the Parquet dataset. This foreign table establishes a seamless connection to DuckDB, providing direct access to the data within the Parquet files.

CREATE FOREIGN TABLE public.taxi_table (
vendor_id text,
pickup_at timestamp,
dropoff_at timestamp,
passenger_count int,
trip_distance double precision,
rate_code_id text,
store_and_fwd_flag text,
pickup_location_id int,
dropoff_location_id int,
payment_type text,
fare_amount double precision,
extra double precision,
mta_tax double precision,
tip_amount double precision,
tolls_amount double precision,
improvement_surcharge double precision,
total_amount double precision,
congestion_surcharge double precision
)
SERVER duckdb_svr
OPTIONS (
table 'read_parquet("/home/xyz/Downloads/taxi-data/*.parquet")'
);

Step 4: Query the Data

Now that we have set up the foreign table, we can run SQL queries against it, benefiting from DuckDB’s outstanding analytical query processing performance.

SELECT count(*) FROM public.taxi_table;

Wow! 21 million records in less than 400 MB of Parquet file size. Impressive data compression.

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (3)

We’re going to check the SQL execution plan to see what PostgreSQL is doing:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (4)

Great, PostgreSQL completely delegates the execution to the extension, and therefore to DuckDB. Response time is fantastic, well done DuckDB!

Let’s see what’s in the tables:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (5)

Let’s try something more complex…

SELECT 
vendor_id, passenger_count, count(*)
FROM
public.taxi_table
GROUP BY
vendor_id, passenger_count
ORDER BY
vendor_id, passenger_count
;
Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (6)
SELECT 
*
FROM
public.taxi_table
WHERE
vendor_id = '1'
AND
pickup_at >= '2019-06-01 00:00:00'
AND
dropoff_at <= '2019-06-02 00:00:00'
;
Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (7)

Well, around 4 seconds to retrieve 90,000 records is not bad, considering that the data is not partitioned (for example, by the ‘vendor_id’ column), I’m retrieving all fields from the table (not the best practice, you should always fetch only what you need), and the files are on my old Laptop with a 9-year-old SATA disk that has seen better days ;-)

Let’s check the SQL execution plan:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (8)

Alright, everything delegated to DuckDB, PostgreSQL just waiting for results.

Let’s try an aggregate function in the SELECT clause:

SELECT 
vendor_id,
min(passenger_count) AS min_pgc,
max(passenger_count) AS max_pgc,
count(*)
FROM
public.taxi_table
GROUP BY
vendor_id
ORDER BY
vendor_id
;
Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (9)
Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (10)

Everything’s in order.

Step 5: When things are not so pretty

Let’s see what PostgreSQL does with this query:

SELECT 
vendor_id,
passenger_count,
count(*) AS nr,
avg(trip_distance) AS avg_dist,
array_agg(trip_distance) AS distances
FROM
public.taxi_table
WHERE
pickup_at >= '2019-06-01 00:00:00'
AND
dropoff_at <= '2019-06-01 00:02:00'
GROUP BY
vendor_id, passenger_count
ORDER BY
vendor_id, passenger_count
;
Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (11)

Perfect results, but let’s check the execution plan:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (12)

PostgreSQL has broken down the query into two stages. The first one sent to DuckDB applies only the WHERE clause without grouping, and the second stage is executed with the PostgreSQL engine, where grouping is applied to the first subset of records obtained:

  • First stage in the SQL execution plan (For DuckDB):
-> Foreign Scan:
SELECT
"vendor_id",
"passenger_count",
"trip_distance"
FROM
read_parquet("/data/taxi-data/*.parquet")
WHERE
"pickup_at" >= '2019-06-01 00:00:00'
AND
"dropoff_at" <= '2019-06-01 00:02:00'
ORDER BY
"vendor_id" ASC NULLS LAST, "passenger_count" ASC NULLS LAST
  • Second stage (For PostgreSQL):
-> GroupAggregate:
Output:
vendor_id,
passenger_count,
count(*),
avg(trip_distance),
array_agg(trip_distance)
Group Key:
taxi_table.vendor_id, taxi_table.passenger_count

It doesn’t seem too serious, but consider a scenario where we have a query that, when grouped, returns very few records, but without grouping (what DuckDB is returning for PostgreSQL to perform the final grouping), there are thousands and thousands of records. We are wasting resources, sending many data between stages, and degrading response time; not leveraging the columnar data structure of Parquet and performance of DuckDB, right?

Let’s take a look at the source code of the plugin to see what’s happening. The extension only supports a subset of functions in the SELECT clause, and “array_agg” is not among them (only “sum”, “min”, “max”, “avg” and “count” are managed). So, PostgreSQL can’t delegate the grouping to DuckDB because the plugin doesn’t support that function. It makes sense what’s happening.

But why not change this? Adding that function alongside the plugin would be a significant improvement in many scenarios, wouldn’t it? Let’s go for it.

I have implemented the code (available in this pull request) mostly to add support to manage list types in the plugin. Functions like “array_agg” return arrays of values. Let’s see what happens with the query again:

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (13)

Great, PostgreSQL now recognizes that the “array_agg” function can also be processed in the extension, so it delegates all the work to the remote server and thus to DuckDB. Everything is executed in a single step, and the crucial part is that grouping is applied during the reading from the Parquet data source.

Yes, I know, this query doesn’t get the performance almost improved, but in this case, the amount of data I’m retrieving is minimal. I simply used it as an example for educational purposes. I’m sure you’ll find a use case that can benefit from it and show the improvement :-)

It’s been a fun journey for me, and I hope it has been interesting for you as well. If the code is deemed valuable, I hope it will be merged into the repository.

In conclusion, by integrating DuckDB FDW in PostgreSQL to load Parquet files as foreign tables, we unlock a powerful solution for data analytics. This combination of PostgreSQL’s reliability, Parquet’s storage efficiency, and DuckDB’s query processing speed elevates data management and analysis to a new level. It’s an ideal choice for organizations looking to enhance their data analytics while maintaining their existing PostgreSQL infrastructure.

  • Thanks to people and organizations who develop amazing open source stuff; PostgresSQL, DuckDB, Parquet are indeed essential components in the world of data tooling, and they’ve made a significant impact on data management and analysis.
  • Thanks to the team in duckdb_fdw for developing this great piece of software, this integration is very useful!

Open source communities thrive on recognition and support from users like you!

Loading Parquet in PostgreSQL via DuckDB: Testing queries and exploring the Core (2024)

References

Top Articles
Latest Posts
Article information

Author: Merrill Bechtelar CPA

Last Updated:

Views: 6001

Rating: 5 / 5 (50 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Merrill Bechtelar CPA

Birthday: 1996-05-19

Address: Apt. 114 873 White Lodge, Libbyfurt, CA 93006

Phone: +5983010455207

Job: Legacy Representative

Hobby: Blacksmithing, Urban exploration, Sudoku, Slacklining, Creative writing, Community, Letterboxing

Introduction: My name is Merrill Bechtelar CPA, I am a clean, agreeable, glorious, magnificent, witty, enchanting, comfortable person who loves writing and wants to share my knowledge and understanding with you.