Loading your Cosmos DB data into a Synapse Spark Notebook
What is, in my experience, the best way to get data from Azure Cosmos DB. In this blog, I will provide examples of how to set up a connection to Azure Cosmos DB in Azure Synapse Notebooks. And explain the difference between the analytical store and the operational store.
In my current project, I’m working with Azure Cosmos DB in Synapse Notebooks. Within Spark Notebooks we load data from SQL Server and Cosmos DB. After loading all records, the records are compared for changes and upserted to the Cosmos DB Container.
Cosmos DB Authentication
In short, there are two options here; Managed Identity or the Account Key.
I would always recommend using Managed Identity over the Account Key where possible. When the Account Key is not used, you can disable the Account Key for your entire Cosmos DB Account!
If you do use the Account Key, make sure you rotate the keys periodically.
Unfortunately, at this moment, Managed Identity authentication from Synapse Spark Notebooks is not supported at this time.
Managed Identity authentication does work for copy activities in a regular Synapse Pipeline. When using a Synapse Managed Virtual Network and Private Endpoints no specific RBAC role is needed on the Azure Cosmos DB to read data. (** I am not sure if this is a bug or by design **)
Synapse Link for Azure Cosmos DB
When extracting data from a Cosmos DB, like any other operational system, we want to prevent this from having a negative impact on the database performance. Synapse Link provides us with an option to create an Analytical store. This is a Column Store optimized for analytical queries.
Read more on how to enable Synapse Link here:
Azure Synapse Link for Azure Cosmos DB, benefits, and when to use it | Microsoft Learn
Python Spark Examples
These examples show the difference between a direct connection and a connection via a Linked Service Please note the difference in the ‘.format()’ function between reading and writing. OLAP refers to the Analytical store, this store is optimized for reading large amounts of data in bulk. OLTP is the Transactional store.
It is not possible to write to the Analytical store directly!
# Read Cosmos DB Analytical store directly via endpoint and account key
cdb_analyticalstore_directly_df = spark.read\
.format("cosmos.olap")\
.option("spark.cosmos.accountEndpoint", cosmosdb_endpoint )\
.option("spark.cosmos.accountKey", cosmosdb_accountkey)\
.option("spark.cosmos.database", cosmosdb_databasename)\
.option("spark.cosmos.container", cosmosdb_container_id)\
.load()
# Read Cosmos DB Analytical store via Synapse Linked Service
cdb_analyticalstore_linkedservice_df = spark.read\
.format("cosmos.olap")\
.option("spark.synapse.linkedService", synapse_linked_service_name)\
.option("spark.cosmos.database", cosmosdb_databasename)\
.option("spark.cosmos.container", cosmosdb_container_id)\
.load()
# Write to Cosmos DB Transactional store via Linked Service
yourdataframe.write.format("cosmos.oltp")\
.option("spark.synapse.linkedService", synapse_linked_service_name)\
.option("spark.cosmos.database", cosmosdb_databasename)\
.option("spark.cosmos.container", cosmosdb_container_id)\
.mode('append')\
.save()
Synapse Serverless SQL Pool provides a great way to query an Azure Cosmos DB Analytical store as well.
Microsoft Documentation
The below link describes this setup as well. Please mind that, in my opinion, this article is missing the instruction on how to set up your linked service to Cosmos DB and the code samples are missing a database reference (if not included in your linked service)
Originally published at http://sidequests.blog on February 20, 2023.