← Back to overview
May 30, 2014 · Disaster Recovery Failover High-Availability Traffic Manager

My talk at DevCon 2014: Failover for Microsoft Azure (Web Sites)

Yesterday I did a talk at DevCon 2014 on how Microsoft Azure has native support for high-availability and disaster recovery. We talked about what you get out-of-the-box for high-availability and which tools, services and patterns you can use for disaster recovery.

The talk was recorded and is available here. There's just 1 big video for all the sessions in the main hall (Cortana), so you'll need to use the right menu to navigate to my session.

The slides are available on my OneDrive or you can also view them here with the PowerPoint Online:

And for those of you that want to try failover with the NuReview application I showed, the source is available on GitHub: https://github.com/sandrinodimattia/NuReview

Disaster Recovery for the NuReview application

During the session I talked about how I added disaster recovery (with degraded functionality) to the NuReview application. Here's how you can try this yourself.

Normal Deployment

The first step is to deploy the application without disaster recovery. You can start by creating a new storage account in a specific region (West-Europe for example) with Read-Access Geo Replication enabled(even if we won't be using it in the first step). After that you can go ahead and deploy the application to a Web Site in the same region and scale it to Standard Mode. On the Configuration tab of the Web Site you can then add the following AppSettings:

After that the NuReview application should be fully functional and you'll be able to write new reviews:

And now it's time to break the application. An easy way to do this is by changing the connection string to your storage account. As a result, the application will no longer work and we could say that our business is down:

You can now undo this to make sure the site is working again.

Deployment that can survive failure of a complete region

The next step is to deploy the same application to a different region for disaster recovery. Since we enabled Read-Access Geo Replication we'll need to deploy our application to the other region in the same geo-political region. Since our primary deployment is in West-Europe, we'll be using North-Europe as our secondary deployment (make sure you also scale it to Standard).

Start by deploying your application here and see if everything works correctly. Now if you look through the code you'll see that the application will behave a little different if the setting Failover is set to 1 or true:

public void SubmitReview(string name, string packageId, string body, int score)  
{
    if (Helper.IsSiteInFailoverMode)
    {
        throw new Exception("We're sorry, but writing new reviews is currently not possible");
    }

    // Persist to table storage.
    GetTable().Execute(TableOperation.Insert(new Review
    {
        Name = name,
        Body = body,
        PackageId = packageId,
        CreatedOn = DateTimeOffset.UtcNow,
        PartitionKey =  "0" + (DateTimeOffset.MaxValue.Ticks - DateTimeOffset.UtcNow.Date.Ticks),
        RowKey = "0" + (DateTimeOffset.MaxValue.Ticks - DateTimeOffset.UtcNow.Ticks) + "+" + Guid.NewGuid(),
        Score = score
    }));
}

This is mainly because we'll be doing DR with degraded functionality and because in failover mode we'll be accessing our storage account in secondary mode (the read-only replica).

Users will be able to read the existing reviews, but writing new reviews will be impossible. And you'll also see some changes in the Layout page that notify the user that the site is not fully functional:

You can now add a new setting under AppSettings (Configuration tab) of the newly deployed website:

The last thing to do is to setup Traffic Manager that will be monitoring our primary deployment. And as soon as something goes wrong, Traffic Manager will make sure that all new DNS queries to access our application will point to our secondary deployment. Here's how you do it:

  1. Create a new Traffic Manager policy
  2. Add both deployments as endpoints
  3. Change the TTL to 30 seconds to make sure you quickly see the failover if the DNS is cached
  4. In the monitoring settings, change the path to /monitoring. This will hit our custom monitoring code that will check if all features we depend on (like storage) are working correctly

If you hit the url provided by Traffic Manager (like nureview.trafficmanager.net) you'll see that it's showing the fully functional site again. Now head back to the portal, and break the primary deployment (you can do this again by changing the connection string or stopping the site). If you refresh the page you might see that the site is no longer working, but in that case you're experiencing local downtime. But the platform itself will still be online, because Traffic Manager will have failed over to the secondary deployment.

Wait 30 seconds or try this from a different device and you should see that Traffic Manager failed over to the secondary deployment.

If you want to test the fallback, just make sure your primary deployment is working again (start the site or fix the connection string) and you'll see that after a few seconds Traffic Manager will fallback to the primary deployment.

Spazibo!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket
Comments powered by Disqus