During my time at a client, I made extensive use of Azure Data
Factory to move and transform data. This tool, backed by Spark
clusters, is highly useful for moving terabytes of data at once or
on a schedule.
As this technology is built on a foundation of other technologies
(Synapse, Spark, etc) there are a handful of strange limitations that
must be considered when applying a data factory in your solution
architectures. Most of these are not clearly stated at the top of
their respective documentation pages, usually being (intentionally?)
pushed to the very end.
The Lookup Activity supports a max of 5000 rows or 4mb of data
returned
The GetMetaData Activity supports a max of 5000 items of 4mb of
data returned
The ForEach Activity supports a max of 100,000 output items
You can’t nest ForEach activities
The Function Activity has an unchangable timeout of 230 seconds1
Errors quite often don’t mean what they say:
Carefully investigate and list the limitations of your sources and sinks
Perform a lot of initial testing to validate your choices in technology
Many Azure systems and tools have shocking and strange limitations
The only thing you can do is perform a large volume of research -
describe your use cases to LLMs and read lots of documentation, as
there are too many small limits for even a seasoned subject matter
expert to recall at once.
Default timeouts are set to 12h, absurdly high for most activities
Lessons learned when building and deploying data factory pipelines.
When designing tables for your environments, do not append _dev
or _v4 to your table names, stored procedures, or anything else.
When copying your templates to production all these small strings
will need to be corrected.
Make good use of stored procedures to track and trace the data
processing and movement within your pipeline. This adds to the
transparency of what has been completed if a pipeline fails.
The default timeouts are usually way too high. Aim to stall for
a maximum of ten minutes (beyond typical execution time) before
failure, and repeat only a few times.
When viewing a list of pipeline runs
Expect to have failures in production when first testing
pipelines at scale due to throughput limits on some Azure
resources - databases, storage accounts, etc.
If you are ever typing ‘dev’, STOP AND THINK. If your pipelines
will be moved between environments, use Global Parameters to hold
things like table schemas, blob container names, or other
environment-specific variables. The only good place to type DEV is in
your linked services definitions or global variables.
The
Copy Activity
is great for moving massive amounts of data from a souce to a
destination with zero transformation and minor filtering on dates. Using it
outside of this very specific use case will result in high costs
and is generally better to implement with an external system.
Insight: With file path type“list of files” you can
provide a list of files with full paths in the filesystem to copy,
but cannot provide a destination path. What insane tooling. The
docs
for this are very disappointing.
“Flatten hierarchy” will give your files weird, long names. Use “preserve hierarchy”.
Sources/sinks are
too numerous to
list
,
but the key ones I know how to use are as follows:
Execute Pipeline
: Execute Pipeline activity allows a Data Factory or Synapse pipeline to invoke another pipeline.
Filter
: Apply a filter expression to an input array
For Each
: ForEach Activity defines a repeating control flow in your pipeline. This activity is used to iterate over a collection and executes specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping structure in programming languages.
Get Metadata
: GetMetadata activity can be used to retrieve metadata of any data in a Data Factory or Synapse pipeline.
If Condition Activity
: The If Condition can be used to branch based on condition that evaluates to true or false. The If Condition activity provides the same functionality that an if statement provides in programming languages. It evaluates a set of activities when the condition evaluates to true and another set of activities when the condition evaluates to false.
Lookup Activity
: Lookup Activity can be used to read or look up a record/ table name/ value from any external source. This output can further be referenced by succeeding activities.
Set Variable
: Set the value of an existing variable.
Until Activity
: Implements Do-Until loop that is similar to Do-Until looping structure in programming languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true. You can specify a timeout value for the until activity.
Validation Activity
: Ensure a pipeline only continues execution if a reference dataset exists, meets a specified criteria, or a timeout has been reached.
Wait Activity
: When you use a Wait activity in a pipeline, the pipeline waits for the specified time before continuing with execution of subsequent activities.
Web Activity
: Web Activity can be used to call a custom REST endpoint from a pipeline. You can pass datasets and linked services to be consumed and accessed by the activity.
Webhook Activity
: Using the webhook activity, call an endpoint, and pass a callback URL. The pipeline run waits for the callback to be invoked before proceeding to the next activity.
The
Execute Pipeline Activity
enables the launching of a new pipeline from an initial pipeline.
The master pipeline will execute a slave pipeline which
The Lookup Activity
enables the querying of data from a linked SQL service for further handling.
Supports a max of 5000 rows or 4mb of data, whichever comes
first. How useless! How incredibly fucking useless!!! How do
they expect us to do big data with this shit?!
Solution: Perhaps use ORDER BY ID and OFFSET with LIMIT
to do batches of 1000.
This may not be so bad as it can guarantee somewhat parallel
execution of groups of up to 5000 items - meaning the maximum time
a pipeline takes to run can be dramatically reduced and a
predictable maximum completion time can be estimated.
You can
execute in parallel
with a maximum parallelism of … 50? Default 20!? Sigh. The variable batchCount controls this, and
isSequential must be set to False for this to work, which is the default.
Limitations: (to fix any of these, use nested pipelines)
ForEach has a maximum of 100,000 items
You can’t nest ForEeach
You can’t use SetVariable to manage global pipeline variables
What if you required the product of these two sets for a computation?
There is literally no way in Azure Data Factory to produce a
cartesian product with the data above within a single pipeline. It
must be sent to an external activity - a data flow, function, or set
of pipelines. Here I used an Azure SQL server stored procedure to do
the ‘heavy lifting’ and passed the two arrays to the procedure as JSON
encoded in strings.
CREATEPROCEDURE[abcd].[pr_get_global_cartesian_product]@JsonArrayANVARCHAR(MAX),@JsonArrayBNVARCHAR(MAX)ASBEGINSETNOCOUNTON;DECLARE@ATABLE(val1NVARCHAR(100));DECLARE@BTABLE(val2NVARCHAR(100));IFISJSON(@JsonArrayA)=0ORISJSON(@JsonArrayB)=0BEGINRAISERROR('One or both input parameters are not valid JSON arrays.',16,1);RETURN;END;-- Parse the JSON arrays into table variables
INSERTINTO@A(val1)SELECTvalueFROMOPENJSON(@JsonArrayA);INSERTINTO@B(val2)SELECTvalueFROMOPENJSON(@JsonArrayB);-- Return the Cartesian product
SELECTa.val1asName,b.val2asNumberFROM@AaCROSSJOIN@BbORDERBYa.val1;END;-- Example execution:
EXECUTE[olr2].[pr_get_global_cartesian_product]@JsonArrayA='["Bob", "Bill", "Joe"]',@JsonArrayB='["189", "873", "291"]';
This will return the data in roughly this shape in the ADF:
Here Be Dragons: As with any Microsoft product, there are many
bizarre and strange side effects and bugs that will interrupt your
workday and make you want to slam your keyboard over your knee. You
will be doomed to ceaselessly stumble into nonsensical undocumented
semi-automatic footguns at an alarming rate. Support will gaslight you
into thinking you are the problem until you provide incontrovertible
evidence that another band-aid must be stuck on their product entirely
made from popsicle sticks, band-aids, and white glue (plus great open
source projects somewhere in that mix.)
Rename a file in Azure Data Studio? Lose the contents of the related
buffer and those crucial queries you had written. Go rewrite them and
cry.
Close your browser window? Go rewrite those pipelines and cry.
[Product Name] crashes? Go rewrite your work and cry.
Thought you could run that external thing? No. Barely documented
timeout.1 Redesign your thing and go cry about it.
This doesn’t even get in to the seemingly thousands of strange legacy
limitations inherent in the Microsoft ecosystem. It’s tough to know
about all of them, even as a subject matter expert. Skim all the
docs you can ahead of time to learn about as many of the limitations
as possible during your design phase.
“Go rewrite it and cry” really should be the byline of most
Microsoft cloud editing software. Triply check your work is saved
before every action, or it will be lost before you know it. As an
adult, to cry is really to loudly shout curses at the ceiling and
stare miserably at your keyboard, but that’s a little long to type
three times. Save your stuff.OR ELSE.
“Azure Function activity in Azure Data Factory: Timeout and long-running functions”
microsoft.com
↩︎↩︎