Database Crash Recovery Takes Forever After Planned Manual Failover

Weekend Patching, Business As Usual

Today I was patching a bunch of servers at a client as Microsoft released their most recent Security Patch for SQL Server that addresses an exploit / vulnerability found after patch Tuesday. Read more on it here. I went through the normal patching routine for an AG; working through all of the secondary replicas, patching one at a time, and making sure we were catching up in the AG dashboard after each reboot as I was moving along.

Noticed Something Unusual

While I was working through the secondary replicas, after each reboot, I noticed that there was one database in particular that seemed to have a longer crash recovery process than the others. After each reboot it would be in that Not Synchronizing / In Recovery state for a bit, but I would see that it was making progress through its analysis, redo, and undo phases relatively quickly, then catch up in the AG and I’d move on.

Where Things Started To Go South

To paint a bit of a better picture of the order of events, while also maintaining anonymity for the client’s architecture, I just finished patching the final secondary of the AG. Let’s call it Replica-2. After I patched and rebooted Replica-2, I made sure that we were all green on the AG dashboard, then failed over from the current primary, Replica-1 to Replica-2. After this failover, that’s when things went sideways. On Replica-2 (The new primary), that one database that I mentioned before that took a while to come out of its crash recovery process on the other secondary replicas, was in Not Synchronizing / In Recovery on the new primary, as well as all secondary replicas in the AG (This is a 6-replica AG by the way). Not to mention, on the new secondary (Replica-1) all databases were not synchronizing.

Someone Call The Doctor

Now, I shifted into “fix-it” mode, so my goal is to get databases healthy as quickly as I possibly could. Order of priority for me was:

Bring the database in Not Synchronizing / In Recovery on the new primary (Replica-2) back to a healthy state.
Get all databases on the former primary (Replica-1) to be back in a synchronized state with the primary.
Get that single database in Not Synchronizing / In Recovery back to a normal healthy state on all secondary replicas.

Task 1: Crash Recovery

With the first task, there wasn’t much I could do other than let the database go through its crash recovery process on the primary. The curious thing to me, however, was why was that database back in its crash recovery process on the new primary if everything was showing green and synchronized prior to the failover. My thoughts could only conclude that there was clearly something that was going on prior to failover in that database where things may not have been as “caught up” as they appeared after Replica-2‘s reboot. After some retroactive looking around, prior to that failover from Replica-1 to Replica-2 on Replica-1, there was an UPDATE statement that had been running for almost 2 hours in the database that kept going into crash recovery on the other replicas post reboot and had two open transactions and had done over 99 million writes. Facepalm…

This was definitely the cause here. With such a long and write heavy transaction going on prior to the failover on Replica-1, after I had failed over, of course that database had to go through a long and grueling crash recovery phase with analysis, redo, and undo. The database eventually came online on Replica-2 (the new primary) after the crash recovery was done. Interestingly, the database was accessible during Phase 3 of crash recovery. I was monitoring via the SQL Server Log and was surprised at this.

Task 2: Get Databases Synchronized Again

The databases on Replica-1 were not re-establishing any proper connections with the primary after the failover. After some poking around and troubleshooting, I had to boot the OS on the secondary as a SQL Server Service restart was just hanging. Once I did this though, databases were happy again, except for that one that had the long running transaction in it prior to failover. That was still in crash recovery and making VERY slow progress in phase 2.

Task 3: Address All Other Secondary Replicas

Throughout the time of troubleshooting the above two tasks, the other secondary replicas of the database that had that long running transaction going on eventually made it out of crash recovery and caught up.

Still Not Out Of The Woods

If you noticed, in the Task 2 section, I did not mention that the database with the long running transaction ever made it out of crash recovery on the new secondary replica. It’s weird… While monitoring both SQL Logs of Replica-1 (The current secondary) and Replica-2 (The current primary), the trouble database was going through crash recovery still on both replicas. On Replica-2 it was in phase 3, Undo, but it was accessible, and on Replica-1 it was still in phase 2, Redo and was still in the Not Synchronizing / In Recovery state. Once the database on Replica-2 hit about 90% on phase 3, a message was logged in its SQL Log stating that a connection had been successfully established with the secondary database. A similar message was reflected on Replica-1. After this, the crash recovery process did not proceed for 35-45 minutes on Replica-1. Now, any research on crash recovery will tell you to just be patient. Wait for the database to finish going through its process. However, with no progress in such a long time, I was getting concerned that things were hung. Mind you, all other replicas, including the primary were green. It was just this one. Now, this next step I took to fix the issue might make some of you cringe depending on what you would do, but what I did was take the database out of the AG, and just re-seed it across. It was about 280GB in size, so not too big, and I turned on trace flag 9567 on the primary replica to make automatic seeding go faster via compression. Doing this got everything in the AG dashboard to read green and healthy again, so I was happy with the decision I made.

Accelerated Database Recovery

Before I went ahead and patched Replica-1, I went ahead and turned on Accelerated Database Recovery for this database that I was talking about earlier and one other database as well. Feel free to read up on it if you’re not familiar via the link above. ADR is great for a situation like the one I went through today. Knowing this environment well, and having poor experiences with similar issues and poor recovery times after failover, I made the choice to implement it here. After I patched Replica-1 and failed back over, this was the first time I wasn’t greeted with long crash recovery times for the two databases I turned it on for. We’ll see how it performs if I have to do another failover here.

Final Takeaways

AG dashboards aren’t always telling the full story.
Before a manual failover, check for long-running transactions using sp_WhoIsActive or the query provided.
Long-running transactions + AG failover = possible long and painful crash recovery.
Research and possibly enable Accelerated Database Recovery on long running transaction prone databases if you want to reduce post-failover pain. Again, this isn’t a magic bullet. There are tradeoffs that should be researched and tested before you implement in your environment.
Compression with TF 9567 speeds up automatic seeding during recovery or re-seeding efforts.

SELECT
    s.session_id,
    r.status,
    r.command,
    r.database_id,
    DB_NAME(r.database_id) AS database_name,
    r.start_time AS transaction_start_time,
    DATEDIFF(SECOND, r.start_time, GETDATE()) AS transaction_duration_seconds,
    r.wait_type,
    r.wait_time,
    r.blocking_session_id,
    t.text AS sql_text
FROM sys.dm_exec_requests r
JOIN sys.dm_exec_sessions s
    ON r.session_id = s.session_id
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.command IN ('INSERT', 'UPDATE', 'DELETE', 'MERGE', 'SELECT INTO', 'BULK INSERT')
  AND r.transaction_id IS NOT NULL
ORDER BY transaction_duration_seconds DESC;

This will tell you if you have any long running transactions going on and what they’re doing before failover. Remember, long running transactions + failover = possible long and painful crash recovery process which can make your customers, devs, and even your fiancée upset with you if you aren’t careful and spend another hour and a half writing this blog after you spent all day troubleshooting your client’s AG.

DBA Unfiltered

Database Crash Recovery Takes Forever After Planned Manual Failover

Weekend Patching, Business As Usual

Noticed Something Unusual

Where Things Started To Go South

Someone Call The Doctor

Task 1: Crash Recovery

Task 2: Get Databases Synchronized Again

Task 3: Address All Other Secondary Replicas

Still Not Out Of The Woods

Accelerated Database Recovery

Final Takeaways

Leave a comment Cancel reply

Database Crash Recovery Takes Forever After Planned Manual Failover

Weekend Patching, Business As Usual

Noticed Something Unusual

Where Things Started To Go South

Someone Call The Doctor

Task 1: Crash Recovery

Task 2: Get Databases Synchronized Again

Task 3: Address All Other Secondary Replicas

Still Not Out Of The Woods

Accelerated Database Recovery

Final Takeaways

Share this:

Leave a comment Cancel reply