Merge Delta Lake tables with Azure Functions (delta-rs + polars)

Arthur Steijn
5 min readJan 29, 2024

--

Imagine a scenario where you want to create a delta table, but you don’t have access to a spark cluster. Or maybe there are a lot of small frequent updates for your table. Or in my case, you just want to develop a different way of creating and updating Delta Tables.

This blog is an example implementation of the delta-rs library and polars on an Azure Function App. The solution was designed from an ELT point of view. You might implement this as part of an Azure Data Factory Pipeline, where you first copy data from a source system to a datalake storage container. Or you could implement a Blob Storage trigger in the Function App or within Azure Data Factory. I chose Data Factory because it is the go-to tool for ETL orchestration. Also, we need the Primary Keys to merge the table! We call the function and pass the primary keys via the HTTP trigger along with the other parameters needed to create delta tables on our data lake.

Data Factory example

The Data Factory implementation is pretty straightforward. This setup has been developed on the Microsoft AdventureWorks database.

The Copy Data activity copies data from an Azure SQL Database and writes this data as a CSV on an Azure Storage Account.

At the same time, the Lookup Activity gets the Primary Keys from the same source database for the table that is being copied.

After this, we call the Azure Function HTTP endpoint and pass the source path in the data lake, where the file is. As well as the data lake target path where we want to create the Delta Table. Lastly, we provide the Primary Keys.

The first time running the function, a new delta table is created. After this, we will be updating the existing delta table.

The Function returns the result of the deltalake merge statement. This contains information on the action and the number of records. Also, take a look at the Execution time in milliseconds!

Performance

Now the fun part, let’s see if we can break this!

The Function App is running on the Consumption Plan, for more information see the quotas and limits. If you would want to integrate this function into a Virtual Network, at least you would have to upgrade to the premium plan.

Increasing the load

To test the performance I enlarged the SalesOrderHeader and SalesOrderDetail table in the AdventureWorks demo database. I did this with the help of this script, though it needed some small adjustments to work on an Azure SQL DB.

I gradually increased the number of rows in the SalesOrderDetail table and ran the Data Factory pipeline to find the breaking point of the Function.

The target Delta Table, I did not clean and recreate. I increased the number of rows in the database and merged these to the Delta Table.

*This last merge I performed from my local machine, to see if the underlying hardware was the problem. I guess here the settings of the Function App Service Plan are not sufficient anymore. Locally I was able to perform the merge.

Overall I am really impressed with the results. I would never have expected to be able to merge 1 million rows, within a really decent amount of time.

Designing this solution I assumed this would only work for smaller and frequent updates. Let’s say every 15 minutes, 10k to 100k new records would be merged.

For Each Table in SalesLT

After increasing the number of rows for the SalesOrderDetailEnlarged table, I created a For Each loop in Data Factory and triggered the processing of all tables in the SalesLT schema in parallel.

The funny thing is that the merge that failed in the previous step, did succeed today. Reason unknown. There was one other table causing problems when reading it to a data frame, something with the datatype i64.

Below is the result of the for each table Pipeline. All tables were copied and merged within 1.5 minutes! Including the 6.2 mil records of SalesOrderDetailEnlarged!

Thanks a lot for reading. Building and writing this takes a lot of time, so please leave your comments and questions below!

Also, I would really like to say thanks to the maintainers and contributors of the delta-rs and the polars library. Keep up the great work!

GitHub Repo with the Function App source code

This post describes a way to implement the delta-rs library on an Azure Function App to create and merge delta tables, without the use of a Spark Cluster. And a way to leverage this Function App from an ETL perspective with Azure Data Factory.

The repo with source code for the Function App and example parameters can be found in this GitHub Repo

Reference documentation

Create a Python function using Visual Studio Code — Azure Functions | Microsoft Learn

delta-io/delta-rs: A native Rust library for Delta Lake, with bindings into Python (github.com)

Polars — Dataframes for the new era

Originally published at http://sidequests.blog on January 29, 2024.

--

--

Arthur Steijn
Arthur Steijn

Written by Arthur Steijn

Microsoft-oriented Data/DevOps/Cloud Engineer, always looking for new and exciting techniques to improve processes and deliver fast and reliable data pipelines

No responses yet