Error Handling with Durable Functions

I wrote recently about why you should use Azure Durable Functions to implement your serverless workflows rather than just manually chaining together a bunch of functions with queues. There was great news recently that Durable Functions is now in "release candidate", and in this post I want to explore in a bit more detail how it can greatly improve your error handling within workflows.

Unhandled Exceptions

First of all, a quick reminder about how Durable Functions works. You create an "orchestrator function", which defines your workflow. And then create multiple "activity functions", one for each step in your workflow. The orchestrator can call these activities either in sequence or parallel.

In an unhandled exception is thrown by an activity function, it will propagate up to the orchestrator function. This is brilliant as it means the orchestrator can make intelligent decisions on what should happen to the workflow based on an activity failing. This might involve triggering a cleaning up activity, or retrying, or maybe the workflow can carry on regardless.

Of course if the orchestrator function doesn't catch these exceptions itself, then the orchestration will terminate. However, even in this case, we'll get some useful information from the Durable Functions runtime. If we query an orchestration that has failed using the Durable Functions REST API we'll see a runtimeStatus of Failed and in the output we'll get information about which activity function the exception occurred in, and the error message.

So in this example, my Activity2 activity function threw an unhandled exception that was also unhandled by the orchestrator function, resulting in the orchestration ending. Here's the output from the Durable Functions REST API showing the orchestration status:

{
    runtimeStatus: "Failed",
    input: "hello",
    output: "Orchestrator function 'ExceptionHandlingExample' failed: The activity function 'Activity2' failed: \"Failure in Activity 2\". See the function execution logs for additional details.",
    createdTime: "2018-04-30T11:48:28Z",
    lastUpdatedTime: "2018-04-30T11:48:31Z"
}

Catching Exceptions in Activity Functions

Of course, you don't need to let exceptions propagate from activity functions all the way through to the orchestrator. In some cases it might make sense to catch your exceptions in the activity function.

One example is if the activity function needs to perform some cleanup of its own in the case of failure - perhaps deleting a file from blob storage. But it might also be to simply send some more useful information back to the orchestrator so it can decide what to do next.

Here's an example activity function that returns an anonymous object with a Success flag plus some additional information depending on whether the function succeeded or not. Obviously you could return a strongly typed custom DTO instead. The orchestrator function can check the Success flag and use it to make a decision on whether the workflow can continue or not.

[FunctionName("Activity2")]
public static async Task<object> Activity2(
    [ActivityTrigger] string input,
    TraceWriter log)
{
    try
    {
        var myOutputData = await DoSomething(input);
        return new 
        {
            Success = true,
            Result = myOutputData
        };
    }
    catch (Exception e)
    {
        // optionally do some cleanup work ...
        DoCleanup();
        return new 
        {
            Success = false,
            ErrorMessage = e.Message
        };
    }
}

Catching Exceptions in Orchestrator Functions

The great thing about orchestrator functions being able to handle exceptions thrown from activity functions is that it allows you to centralize the error handling for the workflow as a whole. In the catch block you can call a cleanup activity function, and then either re-throw the exception to fail the orchestration, or you might prefer to let the orchestration complete "successfully", and just report the problem via some other mechanism.

Here's an example orchestrator function that has one cleanup activity it runs whichever of the three activity functions the problem was found in.

[FunctionName("ExceptionHandlingOrchestrator")]
public static async Task<string> ExceptionHandlingOrchestrator(
    [OrchestrationTrigger] DurableOrchestrationContext ctx,
    TraceWriter log)
{
    var inputData = ctx.GetInput<string>();
    try
    {
        var a1 = await ctx.CallActivityAsync<string>("Activity1", inputData);
        var a2 = await ctx.CallActivityAsync<ActivityResult>("Activity2", a1);
        var a3 = await ctx.CallActivityAsync<string>("Activity3", a2);
        return a3;
    }
    catch (Exception)
    {
        await ctx.CallActivityAsync<string>("CleanupActivity", inputData);
        // optionally rethrow the exception to fail the orchestration
        throw;
    }
}

Retrying Activities

Another brilliant thing about using Durable Functions for your workflows is that it includes support for retries. Again, at first glance that might not seem like something that's too difficult to implement with regular Azure Functions. You could just write a retry loop in your function code.

But what if you want to delay between retries? That's much more of a pain, as you pay for the total duration your Azure Functions run for, so you don't want to waste time sleeping. And Azure Functions in the consumption plan are limited to 5 minutes execution time anyway. So you end up needing to send yourself a future scheduled message. That's something I have implemented in Azure Function in the past (see my randomly scheduled tweets example), but its a bit cumbersome.

Thankfully, with Azure Functions, we can simply specify when we call an activity (or a sub-orchestration) that we want to retry a certain number of times, and customise the back-off strategy, thanks to the CallActivityWithRetryAsync method and the RetryOptions class.

In this simple example, we'll retry Activity1 up to a maximum of 4 attempts with a five second delay before retrying.

var a1 = await ctx.CallActivityWithRetryAsync<string>("Activity1", 
               new RetryOptions(TimeSpan.FromSeconds(5),4), inputData);

Even better, we can intelligently decide which exceptions we want to retry. This is important as in cloud deployed applications some exceptions will be due to "transient" problems that might be resolved by simply retrying, but others are not worth retrying.

When an activity function throws an exception, it will appear in the orchestrator as a FunctionFailedException, but the inner exception will contain the exception thrown from the activity function. However, currently the type of that inner exception seems to be just System.Exception rather than the actual type (e.g. InvalidOperationException) that was thrown, so if you're making retry decisions based on this exception, you might have to just use its Message, although the actual exception type can seen if you call ToString.

Here's a very simple example of only retrying if the inner exception message exactly matches a specific string:

var a1 = await ctx.CallActivityWithRetryAsync<string>("Activity1", 
    new RetryOptions(TimeSpan.FromSeconds(5),4)
    {
        Handle = ex => ex.InnerException.Message == "oops"
    }, 
    inputData);

Summary

Durable Functions not only makes it much easier to define your workflows, but to handle the errors that occur within them. Whether you want to respond to exceptions by retrying with backoff, or by performing a cleanup operation, or even by continuing regardless, Durable Functions makes it much easier to implement than trying to do the same thing with regular Azure Functions chained together by queue messages.

Comments

February 13. 2019 10:30

I have two questions about error handling in Durable Functions :
- How can you log errors when you are doing retry ? I guess the Handle method could be use to do that but I don't know if it would work and that's not its purpose.
- How can you see in Application Insights which function has retried ? An exception occured in one of my function and appart from trying to guess from its timestamp I am unable to link it to one of my durable functions.

Alexandre

February 13. 2019 12:20

You can inject an ILogger into all durable functions, so from both activities and orchestrators you can log whatever you want. Also CallActivityWithRetryAsync takes a RetryOptions which has a Handle callback so you could put custom logging into that. I'm not sure about the App Insights question - I need to experiment a bit more with that myself. But you can always use the Durable Functions history API to get the full event source output for a particular orchestration which would tell you everything that happened including retries

Mark Heath

February 13. 2019 12:40

I was a little reluctant to use the Handle Callback to do some logging as it's not the purpose of this function and I don't need it otherwhise (I wan't to retry anyway) but okay why not.
I completly forgot about the Durable Functions History API :), I will check what I can get from there. I just need to figure out what my orchestrationId is (my function that starts my workflow is timedtrigger and not http) but that should be okay. Thanks !

December 3. 2019 20:55

nice article. most useful!

Ramon Giovane

June 6. 2021 09:29

Thanks for the article and the PS tutorial!
I see the retry handle seams to be too simplistic. I stumbled across a situation where we need a retry on a Cosmos storage activity. The DocumentClientException has a RetryAfter property which would suggest us how long we should wait (e.g. in Too Many Requests status code, service temporary unavailable), but I don't see how could I use it in a simple way. Handle just returns a bool, but could return something more sophisticated like the retry after timespan. In general the retry wait time lacks a way of waiting differently depending on the exception, unless I'm missing something?

Wojciech Jamrozik