Finding the Mystery behind Always ON Availability Groups Failover

Hi Friends,

Today I am going to share an incident that happened in our environment.  From past 2 weeks we were seeing unexpected fail over of Availability Groups on few Servers. I started investigating the logs (both SQL and Cluster) however I was not able to find the reason.

I was initially clueless however I started checking  the failover times and Windows event logs and finally my guess was right.

By the way I am not saying this has root cause on your systems but at least on my servers it made sense.

To obtain the Fail over times for an Availability Group I made use of my own article 


This will help you  in getting  the fail over times whatever that are present inside the ALWAYS ON extended event logs


As you can see the events happened mostly either at 17:30 or 20:30. This indicates that there is something happening behind the scenes so I started checking system Logs and I found something related.


But I was not able to find useful messages in cluster log at the same time except the below where the node is getting evicted from the cluster 

00003408.00001150::2020/07/04-17:31:13.343 ERR   [NODE] Node 2: Connection to Node 3 is broken. Reason GracefulClose(1226)' because of 'channel to remote endpoint fe80::99db:67df:33e6:1bbb%15:~3343~ is closed'

00003408.00003fb0::2020/07/04-17:31:14.080 ERR   [NODE] Node 2: Connection to Node 1 is broken. Reason GracefulClose(1226)' because of 'channel to remote endpoint fe80::7148:e274:8019:ca65%15:~59505~ is closed'


Eventually in system logs I could see the below messages

The Volume Shadow Copy service entered the running state
The VMware Snapshot Provider service entered the running state.

This indicated that file system backup(veeam in my case as the server is hosted in Vmware) is happening on the Database server  and the server on a whole is getting affected.

Veeam Backup & Replication is a proprietary backup app developed by Veeam for virtual environments built on VMware vSphere and Microsoft Hyper-V hypervisors. The software provides backup, restore and replication functionality for virtual machines, physical servers and workstations as well as cloud-based workload.

you can run the below command and see at what time these events are happening 

Get-EventLog -LogName "System" | ?{$_.Message -like "*The Volume Shadow Copy service entered the stopped state.*"} | Out-GridView



When verified with Backup team they said they have not excluded the drives where database files are hosted and they are backing up all the drives.

When you find the solution everything looks simple otherwise it is nightmare.

Thanks for Reading and catch you all in Next Post.

Comments

bikram said…
Nice post as usual... It gives a better clarity for other DBAs also...
Thiyagarajan said…
We also faced similar issue. We should use veeam agent to push the backup.
Thank you for sharing your experience with us.