Durable Functions Upgrade Strategies

One of the challenges of using Durable Functions is how to handle upgrading a workflow to a new version without breaking any in-progress orchestrations.

Example 1 - inserting a new step

For example, suppose you've created a media processing workflow for uploaded videos. Step [a] performs a virus scan, step [b] performs a transcode, and step [c] extracts a thumbnail image. So far nice and simple.

[a]-->[b]-->[c]

But what if I have a new requirement that I want to perform two transcodes, at different bitrates to create a high and low resolution copy?

I could simply add the new step [d] onto the end of the workflow. This is a relatively safe change and any workflows that were started in the previous version should just be able to carry on with the new orchestrator and will successfully call the new step.

[a]-->[b]-->[c]-->[d]

But maybe I want to add the new step in parallel with the original transcode to create a workflow like this:

     +-->[d]---+
     |         |
[a]--+-->[b]---+-->[c]

Now we are going to confuse Durable Functions if we upgrade while an orchestration is mid-progress. The order of operations has fundamentally changed, and the event sourced history of the in-progress orchestration won't map onto the new orchestrator function.

Example - modifying an activity

There are other more subtle ways which we might make breaking changes even if we don't touch our orchestrator function.

Perhaps we modify activity [a] to write data into a database that a new version of activity [c] needs to make use of. This means that if we upgraded mid-orchestration it is possible for the workflow to have run v1 of activity [a], and v2 of activity [c] which would result in an error.

This means you need to ensure you have a clearly articulated strategy for upgrades to durable workflows, that all developers understand.

Importance of idempotency

Before looking at four possible approaches to handling upgrades to workflows, it's important to realise that one of the biggest keys to success is making sure you write "idempotent" code wherever possible.

If a method is "idempotent" then running it twice has the same effect as running it once.

A classic example is charging a credit card in an ecommerce workflow. If I place an order, and something goes wrong midway through handling that order, the order processing pipeline might need to be restarted. But I don't want my credit card to get charged twice for the same order, and the vendor doesn't want to ship the same order twice.

Achieving idempotency usually involves being able to check "have I already done this?" Of course, that adds complexity, so for some activities you might decide that it doesn't really hurt if it happens twice. Maybe if you send a status update email twice it's not a big deal.

If you have a workflow where each activity is either idempotent or safe to run multiple times, then you're in a much better position to support upgrading to new versions of your workflow code.

Let's consider a few different strategies for handling upgrading to a new version of a Durable Orchestration.

Strategy 0 - Don't make breaking changes!

The first thing to say is that it is sometimes possible to make changes to a workflow that will not break in-progress orchestrations. Knowing what are and aren't breaking changes to a workflow will help you to identify what modifications can be made safely.

Strategy 1 - Upgrade with no workflows in progress

The simplest approach when you do have breaking changes, is to ensure that no workflows are currently in progress when you upgrade to a new version of your orchestration.

How easy that is depends on how frequently your workflows are called and how long they take to complete. If you trigger your orchestrations via a queue message, that gives you an easy way to disable starting new orchestrations temporarily, allowing all in-progress ones to finish. Then, after upgrading, re-enable the queue processor to start working through the queued workflows.

Strategy 2 - Just let them fail

The second approach for breaking changes is simply to allow in-progress orchestrations to fail. This might sound like a crazy idea at first, but if you've taken the trouble to ensure that the activities in your workflow are idempotent, then you can simply detect failed orchestrations and resubmit them.

You can even forcibly stop all in-flight instance using the technique described here. Obviously you'll also need a way to track which ones need to be resubmitted after the upgrade.

Strategy 3 - Separate task hubs

The versioning approach recommended in the official Azure Functions documentation is referred to as "side by side deployments". There are a few variations on how exactly you implement this, but the main way suggested is to deploy an entirely separate Function App containing the new version of your workflow.

That Function App could use its own storage account, or a different "task hub" within the same storage account to keep the Durable Functions orchestration state separate.

The trouble with doing this is that often a Function App contains more than just orchestrators and activity functions. For example if there is a "starter" function that is triggered by a HTTP request or a queue message, then the calling code now needs to know how to direct new requests to the updated Function App.

Strategy 4 - Separate orchestrators and activities

The final strategy is to create alternative orchestrator and activity functions within the same Function App. For example you could create an OrchestratorV2 function, leaving the original orchestrator unchanged to finish off any in-flight orchestrations.

With this approach any starter functions you have can simply be updated to point to the new orchestrator function, and you can eventually retire the code for the original orchestrator once all in-progress workflows have completed.

There's actually another page in the Durable Functions documentation that shows an example of how this setup can be achieved. It claims that every function needs to be branched (e.g. You create a V1 and a V2 of every activity and orchestrator function).

I'm not sure that is necessary. It's perfectly fine for the same activity function to be used in more than one orchestration. But I suppose by branching everything, it makes the code a bit easier to reason about. And hopefully its not too long before you can retire the V1 orchestrator and activity functions.

Summary

If you're using Durable Functions, you do need to think about how you want to version your workflows. Fortunately if you follow good practices of avoiding breaking changes and writing idempotent activity functions, many pitfalls can be avoided. And even when you do need to make breaking changes, there are a variety of strategies you can adopt to ensure all in-progress orchestrations complete successfully.

Comments

March 3. 2020 09:15

The queue option provides greater resiliency. If the queue is at the forefront, it provides a safe shield for any predictable downtime due to changes in the system. If the stakeholders requires a very high SLO (Service Level Objective), a queue/distributed strategy is a solid option. Keep inputs in your queue while extending the workflow.
It also allows you to shift to a more robust strategy later on: side by side deployments or workflow versioning. Additionally, a queue (Service Bus or storage queue) are different Azure services and expose a different SLA; you may even show off stakeholders with metrics such as composite SLA, RTO, RPO (1) and promise no business loss due to change in the workflow. "No business loss" sounds like poesy to stakeholders.
(1) https://docs.microsoft.com/...

Roland Civet