Contained AG Not Synchronizing After Planned Failover: How ADR Saved Us

Mechanical robotic mule carrying multiple servers and data storage units on a mountain path

We had a SQL Server 2022 contained AG that consistently took exactly 10 minutes to recover after every planned manual failover.

Not 9 minutes. Not 11. 10 minutes on the dot.

Both replicas would sit in a Not Synchronizing state, every database showed Recovering, and our fastest path back to being online again was often rebooting both replicas, or just wait it out.

We eventually found a fix to get the primary replica up and available after failover and it involved enabling Accelerated Database Recovery (ADR) on the contained master and msdb databases inside the AG.

I haven’t found Microsoft documentation covering this scenario, so I wanted to document what we observed.

The Bigger Story

Technical Frame-up

Contained Availability Groups are a pretty nifty feature introduced in SQL Server 2022. The functionality of a contained AG is meant to help ease administrative overhead of needing to keep logins and jobs in sync between AG replicas. It takes care of that for you by “containing” a copy of the master and msdb databases within the AG. There are still local master and msdb databases on each replica. You’re able to leverage these contained copies by creating jobs and logins in the context of the AG listener. Anything you create in the local master and msdb databases when connecting directly to either of the replicas will only exist on those individual replicas (kind of like how a traditional AG would function).
When you connect to the listener of a contained AG, you won’t see the contained master and msdb databases. However, if you connect to the replicas themselves, you will. They show up at the same folder level as user databases and have the names: AGName_master & AGName_msdb.

Environment Setup and Context

The architecture of our contained AG is SQL Server 2022, Enterprise, on latest CU at the time. This AG functions as the main hub of all SQL Agent jobs for our organization. A lot of nightly and intraday processing flows through this server. With that in mind, it was determined that it made more sense to utilize a contained AG to keep jobs synchronized between replicas and not have to utilize and maintain a process to keep them synced up. The only caveat to this… there are about 2k jobs maintained in the context of the contained AG.

Now, Microsoft doesn’t have any formal or official documentation on theoretical or practical limits of how many jobs you can have living in the context of a contained AG (at least that I can find at the time of writing this), but this feels like an odd edge case.

Our Symptoms

The symptoms we faced were pretty consistent. We’d do our pre-failover checks:

  • Databases healthy
  • Everything synchronizing
  • No long running transactions
  • Make sure replicas are in synchronous commit

Then, boom manual failover time.

The AG would failover within a few seconds, but after failover, all of the databases on both replicas would go into a Recovering state and not be synchronizing. Now, before we implemented our solution, both replicas in the AG would be in this state for exactly 10 minutes. Every time. Not 10 minutes and 10 seconds. Not 9 minutes and 34 seconds. 10 minutes on the dot. Our options that have gotten us through were what I mentioned above. Either wait, or reboot the OS on both replicas, and that would usually get us rolling again.

Reaching Out For Help

For some slight context, this situation was going on for a while before I joined the team here, so they had been working with Microsoft Support for quite a while on this before I got hired. It was the usual process with MS Support. Send logs. Wait. Send more logs. Wait. Try things and check configs. Send even more logs. Nothing was working. Microsoft called out at one point that in the logs my colleague had sent them were findings of transactions in a rollback process. This was odd to hear because there were no long running and very few transactions even running at the time of the planned failovers. We got feedback from MS Support to check for the usual culprits I had already mentioned above. At this point we and even Microsoft it seemed were grasping at straws to find a cause and solution. My team even tried turning off the SQL Agent before the failover, checking VLF counts on databases, checking and confirming job history retention, msdb size (both local and contained versions). Anything you could suggest we tried looking at.

The Search For A Long Term Solution

When my team first mentioned that they were dealing with this problem, I had thought to myself and eventually suggested that we might be hitting some sort of limitation with roughly 2k jobs living in the context of the contained AG. I didn’t know why and couldn’t explain the internals or technical reasoning, but 2k jobs sounded like a lot to me and seemed like a good enough of a reason to cause us problems.

After watching a live demo of a planned manual failover, and observing the database behavior of all databases going into a Recovering state, I thought about Accelerated Database Recovery (ADR). I don’t want to get into a whole writeup about ADR and what it does, but feel free to follow the link and do your own research. High level though, ADR is a database scoped config that drastically and fundamentally changes recovery behavior and can dramatically reduce recovery and rollback times.

Since the biggest symptom we saw after failing over was databases going into prolonged database recovery on both replicas, the first approach was to turn ADR on for the user databases. I had made this change in the Dev and Test environments and we then scheduled a Change Control to push this to Production. When we made the change in Production, we then initiated another planned manual failover. Still no dice. 10 minutes of recovery and not synchronizing after failover.

I went back to the drawing board and started thinking what if we turn on ADR for the contained master and msdb databases as well? This thought popped into my mind because in the SQL Logs after failover, we saw a message that looked like this:

“7377 transactions rolled forward in database AGName_msdb”

I thought this was incredibly odd as we couldn’t seem to find any sort of related transactions running in this database during these failover windows, plus, MS Support kept calling out many cases of rollback processes without flagging the redo side. But thinking more on how AGs work in regards to how redo latency results in longer recovery times post failover, in addition to the thought of having 2k jobs constantly hammering msdb with history writes and status updates, that redo backlog was going to be there no matter how quiet things looked before we pulled the trigger on the failover. Now, in normal circumstances, you can’t turn ADR on for system databases. You can in 2025 for tempdb now, but that is neither here nor there. Anyways, the contained master and msdb databases are technically looked at as user databases since they’re in the contained AG (at least that is the logic I used when wondering if turning on ADR would actually work or give me an error). The contained msdb database is what carries all of the metadata and SQL Agent Job dependencies across replicas that make it possible to have those jobs available post failover.

With this in mind, and knowing that we have 2k SQL Agent jobs in the contained AG, Microsoft Support’s notes, the redo traffic and that the contained master and msdb databases are technically looked at as user databases, I suggested to my team that we attempt to turn on ADR for AGName_master and AGName_msdb and initiate another planned manual failover.

Putting It To The Test

We enabled ADR on the two contained system databases, kicked off the planned manual failover… BOOM! Primary replica came online in an instant (within a few seconds). The secondary replica still went into that recovery / not synchronizing state for a bit (consistently lasted around 30 seconds to a minute) but didn’t stay that way for very long. We called this a huge win. Our critical AG and primary replica was no longer hosed after a failover for 10 minutes, and we no longer have to resort to rebooting the OS every time for a quick recovery strategy. We tested this with multiple planned failovers and got consistent results. Here’s a code snippet of what was used to enable ADR on the contained system databases for reference.

Always test before you run anything in Production

ALTER DATABASE [AGName_msdb] SET ACCELERATED_DATABASE_RECOVERY = ON WITH ROLLBACK IMMEDIATE;
ALTER DATABASE [AGName_master] SET ACCELERATED_DATABASE_RECOVERY = ON WITH ROLLBACK IMMEDIATE;

I’d like to add that we made sure to go through the proper testing and deployment lifecycle before we implemented this, so I advise you test carefully and involve Microsoft Support if considering similar changes as this is undocumented behavior.

Why We Think This Worked

To be completely honest, I suspect there may be scaling friction or an edge case around large contained msdb metadata footprints. Even though when we would perform a planned manual failover very few, if any of the jobs were running at the time, the contained msdb database having to keep track of 2k jobs upon failover must have been enough to cause consistent and chronic problems. The SQL Server community seems to have very little documented around practical limitations of Contained Availability Groups at scale. We may have found one. If you’re architecting a Contained AG and planning to manage a large volume of jobs through it, ADR on your contained system databases may be worth evaluating early in your lower environments.

2 responses to “Contained AG Not Synchronizing After Planned Failover: How ADR Saved Us”

  1. Brent Ozar Avatar

    Awesome storytelling on what must have been an ugly problem to solve. Good work!

    Liked by 1 person

    1. TheUnfilteredDBA Avatar

      Thank you, Brent!

      Like

Leave a reply to Brent Ozar Cancel reply