Always ON Connection Time-out Errors

Hi Mates,

Today I want to share my Analysis with respect to connection Time out Errors in ALWAYS ON. 

In my previous post I mentioned about how to go through Windows Cluster Logs

https://www.chiranjeevivamsydba.com/2020/10/my-availability-groups-got-affected-by.html

Always ON is such a topic where we need to examine different logs and need to analyze them properly before coming to any conclusions.

Okay, Let me go through the issue. I got an incident from Application Team indicating they experienced some issues yesterday. When I enquired about the time they said it is at early Morning 3 AM.

At first, I checked if there were any Maintenance jobs running at that time as usually these are times when they run. However there were nothing running at that time.

My replicas are in a Multi-Subnet with 3 nodes in Each Data Center

When checked the Error Logs I could see below messages inside them and all of them were with 3 replicas in another DC.

2020-10-23 03:02:56.780 spid207s     A connection timeout has occurred on a previously established connection to availability replica 'XXXXXXX' 

2020-10-23 03:02:56.870 spid295s     A connection timeout has occurred on a previously established connection to availability replica 'XXXXXXX' 

2020-10-23 03:02:56.910 spid295s     A connection timeout has occurred on a previously established connection to availability replica 'XXXXXX' 

2020-10-23 03:02:56.780 spid207s Always On Availability Groups connection with secondary database terminated for primary database 'ReportServerTempDB' on the availability replica 'XXXXXX' with Replica ID: {12769114-d890-4e2d-b4fa-7c346244be1e}. 

2020-10-23 03:02:56.780 spid299s     Always On Availability Groups connection with secondary database terminated for primary database 'ReportServer144' on the availability replica 'XXXX' with Replica ID: {12769114-d890-4e2d-b4fa-7c346244be1e}. 

So Now how will we troubleshoot this. 

1) Error Logs: Those are the Messages what I can see

2) Always ON Health Sessions(Extended Events): They too have the same messages of what I could see inside Error Logs. 



3) SQL Diagnostic Extended Event files: These are available only on Primary Replica and I use them a lot for Investigation Purposes. 





There are 2 things I would like to discuss here firstly at the highlighted part under component state we see it as Warning. Also it says we have something related to Blocking. I got an impression that Blocking was the culprit however the database is not on ReportServer or ReportServerTEMPDB which are the ones that got affected. 


4) System Health Session (Extended Events):  This is kind of redundant to that of SQL Diagnostic in this case even it says the same information to that of above.


5) Windows Cluster Logs: We didn't have any Failover hence there is not much here apart from the below message.  (I understood a lot what are the messages that gets logged for Connection time out Errors but I can't post here as it deviates from the topic)

00001d1c.00002af8::2020/10/23-03:02:56.370 INFO  [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from 'clean' to 'warning' at 2020-10-23 03:02:56.360


Frankly Speaking none of the logs helped me to understand what is the real contributor for these Connection time outs. Also there were no snapshots(VSS) or Antivirus related stuff running at that time so we can rule out them too.

I then thought of reproducing the Errors on my Test machine and I did that using Networking Tools.









As you can see I simulated the stuff and at the same time I could see the message

A connection timeout has occurred while attempting to establish a connection to availability replica 'XXXXXX' with id [B4DF270A-D29D-451A-B2BD-4727323D7234]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

So How will I troubleshoot this? My Answer is try to run a light weight ping test from the source (primary replica) to Secondary servers and see if you are seeing any time outs

Below is the command what I made use of 

ping.exe -t 192.168.58.68|Foreach{"{0} - {1}" -f (Get-Date),$_} > D:\backups\192.168.58.68.txt

As you can see from the above image I got the output in to Notepad file. I would only recommend this if you are seeing connection time out messages quite frequently.

Hope you have Learned Something with this.

Comments

Unknown said…
Its nice resolution
Anonymous said…
Sorry, but didn't understand WHAT was the solution -
We also encounter these timeouts, and can't figure out what's the cause and how to eliminate these errors.
Okay, sometimes there will be packet drops and we will not have any information unless we capture something like what I mentioned at the very end of the post. Once you know there are packet drops happening and at the same time if you see the termination messages for the availability groups inside error logs then we can definitely have a concrete proof before approaching network team.