The movement of data from one place to another is one of the oldest and often, most difficult information technology activities. Add a transformation element, the classic ETL task, and things can become complex.
On Azure, there are primarily two, platform native ways to perform data movement. One method is to use Azure Data Factory, a purpose-built service designed to address the need for a reliable, relatively simple ETL service and Azure Synapse Analytics which offers a pipeline component as a part of the overall suite.
Let’s start with Azure Data Factory (or ADF) which is described by Microsoft as:
“[a] cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.”
Below, let’s explore the two ADF deployment scenarios from a cost perspective.
ADF Data Factory Pipeline
In the above, an ADF data factory pipeline is used (the other option is a SQL Server Integration Services pipeline, which we’ll cover next). This is a service that does not utilize a SQL runtime. The cost profile is very low (in the example shown above, the monthly runtime is 3.55 USD per month) and the service does not generate cost when no pipelines are running.
The second ADF pipeline option utilizes SQL Server Integration:
Note that the use of SQL Integration Services (necessary in scenarios requiring SQL compatible data movement using SSIS) incurs a much larger monthly runtime cost. With Azure Hybrid Benefit the monthly runtime as configured is 328.29 USD and without that benefit, almost twice that amount. This increase is caused by the use of a constantly running VM as process host.
It’s important to carefully choose a service based on requirements. Unnecessary costs can quickly mount if you select a more robust pipeline than needed.
Now let’s take a look at Azure Synapse Pipelines.
Azure Synapse Analytics
Microsoft describes Azure Synapse Analytics like this:
“Azure Synapse is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.“
Here’s an overview of Azure Synapses from a cost perspective:
Unlike ADF which is specifically offered as a pipeline, ETL service, Azure Synapse pipelines are a component of a larger suite. Considering the often quite high monthly runtime cost of a Synapse environment (typically in the thousands). Here’s a focused view of the pipeline element’s pricing:
Although it’s possible to create a reasonably low cost pipeline scenario (as shown above) it’s important to note that utilizing a data flow – which is standard – significantly increases the monthly runtime cost because of the use of a VM (the baseline configuration utilizes a 8Vcore VM that modifies the runtime to approximately 1,594 USD per month).
Azure Data Factory is a purpose built, specialized platform ETL service that, depending on your requirements, can be very cost effective. Azure Synapse Analytics is a sprawling data warehouse, analytics and machine learning suite which includes a data movement pipeline service as part of its tooling. If you don’t need Synapse, and can’t justify the cost, ADF is a solid choice. Synapse pipelines are well featured but, in my view, should be considered a part of your overall use of Synapse and not an element you use alone.