This week I had a chance to work with a client who wanted to start their journey into Business Continuity Planning using the Azure Site Recovery tool. This was their first-time recovery planning, so they wanted to start with a few non-production machines. They could watch for any performance impacts to the machine and end with a failover test. During the experience, I ran into some issues that I thought would be helpful for others.
The scope of this project started with replicating three test VM’s from our primary site to the secondary site. We found there was a specific order that Azure services and processes needed to be done to ensure a successful DR failover for our servers.
We wanted more granular control over the migration, like the storage account, resource groups, availability sets, and the VNET. The other option would be a quick deployment that allows Azure to suggest names for these resources and create them when you enable the replication. This client had a global footprint, which means the DR target location would be in production for a while in the event of a real failover. We needed to treat this like it’s a production deployment.
We started by creating the virtual network and subnets. Our network team got to work, creating the network to meet two important client requirements. First, in the event of a failover, we couldn’t make panicked changes to production. Second, when we do a practice failover, we needed to test the machine on the network.
It’s was important for us to note that this DR target was going to be connected to the global peering network. Due to this the IP address range has to be unique which means that our servers would get new IPs during a failover.
Another note, you can’t select a subnet in the portal configuration when setting up Azure Site Recovery for a VM. So, if you have a complicated Subnet scheme, it just assigns the VM to the first subnet in the group.
The next item was to create our target resource groups. We chose to rename the resource groups with “-DR” appended to the name to make something like “Web-FrontEnd-Prod-DR” as the final name. We created the availability sets for the servers we were testing. We followed the same naming convention as the resource group to keep everything grouped together. When creating the availability set, make sure you choose if you have managed disk or classic disk attached to your VM(s) in this availability set as using the wrong setting will mean you cannot select the Availability Set during configuration.
Lastly, we created the storage account that the VM’s would go into. When working with servers that use managed disk, be aware that you will pay for the managed disk on both the primary site and on the DR site. If you are using classic storage your costs will be consumed like traditional standard storage.
Now that the housekeeping is done, it’s time to replicate these machines. We went through the normal steps outlined in the Azure Docs Site, which does a great job of going over the process.
This is where I ran into my next hurdle that impacts Cloud Solution Providers (CSP’s) and Managed Service Partners (MSP’s). When enabling the replication job, I got an error message immediately stating that it, “Failed to create service principal for automation account <account name>” and the job failed. I did everything right in the configurations, I was in the “Owner” role, which permissions to do what is needed, what gives? After a little head scratching, I was reminded of a similar issue I had when setting up Azure Patch Management for another CSP Client. My ‘daily’ account that I use for portal access, is considered a “Guest” in the Azure Active Directory tenant of the Azure subscription.
My account does not exist in the AD tenant, but I was allowed access because Azure AD is aware of my account, and someone with privilege invited me as a guest into their Azure AD tenant and assigned a role to me. What a great way to reduce account sprawl, or as I call YAUNAP (yet another username and password). In the case of Azure Patch Management, I had to use a Member account of the subscription to create the job because my guest account didn’t have permission to create the automation account.
I went to confirm my suspicion and found the smoking gun. When we created the replication job for ASR, it creates a Run As account to handle the replication work. I went to the Automation Account blade in the client’s portal and then to Run As accounts. There I was greeted with my answer, I don’t have permission. Nothing along the way indicated this would be an issue and it took a lot of digging around to find the problem.
Like the patch problem before, I was able to use a member account to enable the job and had no issues from there. The permission problem is limited to the creation of the Run As account creation and not the management.
Azure Site Recovery is a powerful tool that every Azure customer should be taking advantage of. To run effectively, make sure you understand your infrastructure and how you want it to replicate. Remember, you don’t want to be making decisions during a failover. Proper testing, accurate replication, and making sure you’re signed into the right account can make all the difference in disaster recovery.
Jeremy Brewer is an Azure Architect at Beyond Impact. With years of experience in Azure, Jeremy is an extremely valuable resource for clients looking to get the most out of Azure.
More Disaster Recovery Blogs: