Ryan's Manuals

Azure Data Factory

~8m skim, 1,509 words, updated Jan 29, 2025

Top 

2023-01-03

A data engineering platform


Contents



What is a Data Factory?

During my time at a client, I made extensive use of Azure Data Factory to move and transform data. This tool, backed by Spark clusters, is highly useful for moving terabytes of data at once or on a schedule.

See all my other Azure notes in the Azure Manual.

Key Concepts:

  1. Pipelines
  2. Activities
  3. Datasets
  4. Linked Services
  5. Data Flows
  6. Integration Runtimes

The relationship between data sets, activities, and pipelines. microsoft.com
The relationship between data sets, activities, and pipelines. microsoft.com

Weird ADF Limitations

As this technology is built on a foundation of other technologies (Synapse, Spark, etc) there are a handful of strange limitations that must be considered when applying a data factory in your solution architectures.

Pro Tips

Lessons learned when building and deploying data factory pipelines.

  1. When designing tables for your environments, do not append _dev or _v4 to your table names, stored procedures, or anything else. When copying your templates to production all these small strings will need to be corrected.
  2. Make good use of stored procedures to track and trace the data processing and movement within your pipeline. This adds to the transparency of what has been completed if a pipeline fails.
  3. The default timeouts are usually way too high. Aim to stall for a maximum of ten minutes (beyond typical execution time) before failure, and repeat only a few times.
  4. When viewing a list of pipeline runs
  5. Expect to have failures in production when first testing pipelines at scale due to throughput limits on some Azure resources - databases, storage accounts, etc.

If you are ever typing ‘dev’, STOP AND THINK. If your pipelines will be moved between environments, use Global Parameters to hold things like table schemas, blob container names, or other environment-specific variables. The only good place to type DEV is in your linked services definitions or global variables.

Deployment

Data Movement (Copy) Activity

Literally just the Copy Activity .

Sources/sinks are too numerous to list , but the key ones I know how to use are as follows:

Control Flow Activities

Here is the full list taken and modified from these docs .

Execute Pipeline

The Execute Pipeline Activity enables the launching of a new pipeline from an initial pipeline. The master pipeline will execute a slave pipeline which

Get Metadata

This is for checking filesystems and querying file metadata. There is a specific exists output that you can use .

Lookup

Switch

Just like a lisp cond or C switch.

ForEach

Script

I mostly use this for small SQL operations.

You may get the error “Argument {0} is null or empty” if your return values do not

Data Transformation Activities

Here is a subset taken directly from these docs .

Monitoring, Azure Alerts, Emails

It’s Microsoft, Beware

Here Be Dragons: As with any Microsoft product, there are many bizarre and strange side effects and bugs that will interrupt your workday and make you want to slam your keyboard over your knee.

This doesn’t even get in to the seemingly thousands of strange legacy limitations inherent in the Microsoft ecosystem. It’s tough to know about all of them, even as a subject matter expert.

“Go rewrite it and cry” really should be the byline of most Microsoft cloud editing software. Triply check your work is saved before every action, or it will be lost before you know it. As an adult, to cry is really to loudly shout curses at the ceiling and stare miserably at your keyboard, but that’s a little long to type three times. Save your stuff. OR ELSE.



Site Directory

Pages are organized by last modified.



Page Information

Title: Azure Data Factory
Word Count: 1509 words
Reading Time: 8 minutes
Permalink:
https://manuals.ryanfleck.ca/adf/