During my time at a client, I made extensive use of Azure Data
Factory to move and transform data. This tool, backed by Spark
clusters, is highly useful for moving terabytes of data at once or
on a schedule.
As this technology is built on a foundation of other technologies
(Synapse, Spark, etc) there are a handful of strange limitations that
must be considered when applying a data factory in your solution
architectures.
The Lookup Activity supports a max of 5000 rows or 4mb of data
returned
The GetMetaData Activity supports a max of 5000 items of 4mb of
data returned
The ForEach Activity supports a max of 100,000 output items
You can’t nest ForEach activities
Errors quite often don’t mean what they say:
Carefully investigate and list the limitations of your sources and sinks
Perform a lot of initial testing to validate your choices in technology
Many Azure systems and tools have shocking and strange limitations
The only thing you can do is perform a large volume of research -
describe your use cases to LLMs and read lots of documentation, as
there are too many small limits for even a seasoned subject matter
expert to recall at once.
Default timeouts are set to 12h, absurdly high for most activities
Lessons learned when building and deploying data factory pipelines.
When designing tables for your environments, do not append _dev
or _v4 to your table names, stored procedures, or anything else.
When copying your templates to production all these small strings
will need to be corrected.
Make good use of stored procedures to track and trace the data
processing and movement within your pipeline. This adds to the
transparency of what has been completed if a pipeline fails.
The default timeouts are usually way too high. Aim to stall for
a maximum of ten minutes (beyond typical execution time) before
failure, and repeat only a few times.
When viewing a list of pipeline runs
Expect to have failures in production when first testing
pipelines at scale due to throughput limits on some Azure
resources - databases, storage accounts, etc.
If you are ever typing ‘dev’, STOP AND THINK. If your pipelines
will be moved between environments, use Global Parameters to hold
things like table schemas, blob container names, or other
environment-specific variables. The only good place to type DEV is in
your linked services definitions or global variables.
Insight: With file path type“list of files” you can
provide a list of files with full paths in the filesystem to copy,
but cannot provide a destination path. What insane tooling. The
docs
for this are very disappointing.
“Flatten hierarchy” will give your files weird, long names. Use “preserve hierarchy”.
Sources/sinks are
too numerous to
list
,
but the key ones I know how to use are as follows:
Execute Pipeline
: Execute Pipeline activity allows a Data Factory or Synapse pipeline to invoke another pipeline.
Filter
: Apply a filter expression to an input array
For Each
: ForEach Activity defines a repeating control flow in your pipeline. This activity is used to iterate over a collection and executes specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping structure in programming languages.
Get Metadata
: GetMetadata activity can be used to retrieve metadata of any data in a Data Factory or Synapse pipeline.
If Condition Activity
: The If Condition can be used to branch based on condition that evaluates to true or false. The If Condition activity provides the same functionality that an if statement provides in programming languages. It evaluates a set of activities when the condition evaluates to true and another set of activities when the condition evaluates to false.
Lookup Activity
: Lookup Activity can be used to read or look up a record/ table name/ value from any external source. This output can further be referenced by succeeding activities.
Set Variable
: Set the value of an existing variable.
Until Activity
: Implements Do-Until loop that is similar to Do-Until looping structure in programming languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true. You can specify a timeout value for the until activity.
Validation Activity
: Ensure a pipeline only continues execution if a reference dataset exists, meets a specified criteria, or a timeout has been reached.
Wait Activity
: When you use a Wait activity in a pipeline, the pipeline waits for the specified time before continuing with execution of subsequent activities.
Web Activity
: Web Activity can be used to call a custom REST endpoint from a pipeline. You can pass datasets and linked services to be consumed and accessed by the activity.
Webhook Activity
: Using the webhook activity, call an endpoint, and pass a callback URL. The pipeline run waits for the callback to be invoked before proceeding to the next activity.
The
Execute Pipeline Activity
enables the launching of a new pipeline from an initial pipeline.
The master pipeline will execute a slave pipeline which
The Lookup Activity
enables the querying of data from a linked SQL service for further handling.
Supports a max of 5000 rows or 4mb of data, whichever comes
first. How useless! How incredibly fucking useless!!! How do
they expect us to do big data with this shit?!
Solution: Perhaps use ORDER BY ID and OFFSET with LIMIT
to do batches of 1000.
This may not be so bad as it can guarantee somewhat parallel
execution of groups of up to 5000 items - meaning the maximum time
a pipeline takes to run can be dramatically reduced and a
predictable maximum completion time can be estimated.
You can
execute in parallel
with a maximum parallelism of … 50? Default 20!? Sigh. The variable batchCount controls this, and
isSequential must be set to False for this to work, which is the default.
Limitations: (to fix any of these, use nested pipelines)
ForEach has a maximum of 100,000 items
You can’t nest ForEeach
You can’t use SetVariable to manage global pipeline variables
Here Be Dragons: As with any Microsoft product, there are many
bizarre and strange side effects and bugs that will interrupt your
workday and make you want to slam your keyboard over your knee.
Rename a file in Azure Data Studio? Lose the contents of the related
buffer and those crucial queries you had written. Go rewrite them and
cry.
Close your browser window? Go rewrite those pipelines and cry.
[Product Name] crashes? Go rewrite your work and cry.
This doesn’t even get in to the seemingly thousands of strange legacy
limitations inherent in the Microsoft ecosystem. It’s tough to know
about all of them, even as a subject matter expert.
“Go rewrite it and cry” really should be the byline of most
Microsoft cloud editing software. Triply check your work is saved
before every action, or it will be lost before you know it. As an
adult, to cry is really to loudly shout curses at the ceiling and
stare miserably at your keyboard, but that’s a little long to type
three times. Save your stuff.OR ELSE.