Ryan's Manuals

Azure Data Factory

~10m skim, 1,985 words, updated Apr 15, 2025

Top 

2023-01-03

A data engineering platform


Contents



What is a Data Factory?

During my time at a client, I made extensive use of Azure Data Factory to move and transform data. This tool, backed by Spark clusters, is highly useful for moving terabytes of data at once or on a schedule.

See all my other Azure notes in the Azure Manual.

Key Concepts:

  1. Pipelines
  2. Activities
  3. Datasets
  4. Linked Services
  5. Data Flows
  6. Integration Runtimes

The relationship between data sets, activities, and pipelines. microsoft.com
The relationship between data sets, activities, and pipelines. microsoft.com

Weird, Wild, and Wacky Limitations

As this technology is built on a foundation of other technologies (Synapse, Spark, etc) there are a handful of strange limitations that must be considered when applying a data factory in your solution architectures. Most of these are not clearly stated at the top of their respective documentation pages, usually being (intentionally?) pushed to the very end.

Pro Tips

Lessons learned when building and deploying data factory pipelines.

  1. When designing tables for your environments, do not append _dev or _v4 to your table names, stored procedures, or anything else. When copying your templates to production all these small strings will need to be corrected.
  2. Make good use of stored procedures to track and trace the data processing and movement within your pipeline. This adds to the transparency of what has been completed if a pipeline fails.
  3. The default timeouts are usually way too high. Aim to stall for a maximum of ten minutes (beyond typical execution time) before failure, and repeat only a few times.
  4. When viewing a list of pipeline runs
  5. Expect to have failures in production when first testing pipelines at scale due to throughput limits on some Azure resources - databases, storage accounts, etc.

If you are ever typing ‘dev’, STOP AND THINK. If your pipelines will be moved between environments, use Global Parameters to hold things like table schemas, blob container names, or other environment-specific variables. The only good place to type DEV is in your linked services definitions or global variables.

Deployment

Data Movement (Copy) Activity

The Copy Activity is great for moving massive amounts of data from a souce to a destination with zero transformation and minor filtering on dates. Using it outside of this very specific use case will result in high costs and is generally better to implement with an external system.

Sources/sinks are too numerous to list , but the key ones I know how to use are as follows:

Control Flow Activities

Here is the full list taken and modified from these docs .

Execute Pipeline

The Execute Pipeline Activity enables the launching of a new pipeline from an initial pipeline. The master pipeline will execute a slave pipeline which

Get Metadata

This is for checking filesystems and querying file metadata. There is a specific exists output that you can use .

Lookup

Switch

Just like a lisp cond or C switch.

ForEach

Script

I mostly use this for small SQL operations.

You may get the error “Argument {0} is null or empty” if your return values do not

Data Transformation Activities

Here is a subset taken directly from these docs .

Making Up for Missing Features

Cartesian Product

Let’s say you had two arrays:

["Bob", "Bill", "Joe"]
[189, 873, 291]

What if you required the product of these two sets for a computation?

There is literally no way in Azure Data Factory to produce a cartesian product with the data above within a single pipeline. It must be sent to an external activity - a data flow, function, or set of pipelines. Here I used an Azure SQL server stored procedure to do the ‘heavy lifting’ and passed the two arrays to the procedure as JSON encoded in strings.

CREATE PROCEDURE [abcd].[pr_get_global_cartesian_product]
  @JsonArrayA NVARCHAR(MAX),
  @JsonArrayB NVARCHAR(MAX)
AS
BEGIN
  SET NOCOUNT ON;

  DECLARE @A TABLE (val1 NVARCHAR(100));
  DECLARE @B TABLE (val2 NVARCHAR(100));

  IF ISJSON(@JsonArrayA) = 0 OR ISJSON(@JsonArrayB) = 0
  BEGIN
    RAISERROR('One or both input parameters are not valid JSON arrays.', 16, 1);
    RETURN;
  END;

  -- Parse the JSON arrays into table variables
  INSERT INTO @A (val1)
  SELECT value FROM OPENJSON(@JsonArrayA);

  INSERT INTO @B (val2)
  SELECT value FROM OPENJSON(@JsonArrayB);

  -- Return the Cartesian product
  SELECT a.val1 as Name, b.val2 as Number
  FROM @A a
  CROSS JOIN @B b
  ORDER BY a.val1;
END;

-- Example execution:
EXECUTE [olr2].[pr_get_global_cartesian_product]
  @JsonArrayA = '["Bob", "Bill", "Joe"]',
  @JsonArrayB = '["189", "873", "291"]';

This will return the data in roughly this shape in the ADF:

[
  { "Name": "Bob", "Number": 189 },
  { "Name": "Bob", "Number": 873 },
  { "Name": "Bob", "Number": 291 },
  { "Name": "Bill", "Number": 189 },
  { "Name": "Bill", "Number": 873 },
  { "Name": "Bill", "Number": 291 },
  { "Name": "Joe", "Number": 189 },
  { "Name": "Joe", "Number": 873 },
  { "Name": "Joe", "Number": 291 }
]

CI/CD

Monitoring, Azure Alerts, Emails

It’s Microsoft, Beware

Here Be Dragons: As with any Microsoft product, there are many bizarre and strange side effects and bugs that will interrupt your workday and make you want to slam your keyboard over your knee. You will be doomed to ceaselessly stumble into nonsensical undocumented semi-automatic footguns at an alarming rate. Support will gaslight you into thinking you are the problem until you provide incontrovertible evidence that another band-aid must be stuck on their product entirely made from popsicle sticks, band-aids, and white glue (plus great open source projects somewhere in that mix.)

This doesn’t even get in to the seemingly thousands of strange legacy limitations inherent in the Microsoft ecosystem. It’s tough to know about all of them, even as a subject matter expert. Skim all the docs you can ahead of time to learn about as many of the limitations as possible during your design phase.

“Go rewrite it and cry” really should be the byline of most Microsoft cloud editing software. Triply check your work is saved before every action, or it will be lost before you know it. As an adult, to cry is really to loudly shout curses at the ceiling and stare miserably at your keyboard, but that’s a little long to type three times. Save your stuff. OR ELSE.


  1. “Azure Function activity in Azure Data Factory: Timeout and long-running functions” microsoft.com  ↩︎ ↩︎



Site Directory

Pages are organized by last modified.



Page Information

Title: Azure Data Factory
Word Count: 1985 words
Reading Time: 10 minutes
Permalink:
https://manuals.ryanfleck.ca/adf/