Memory Issue in Always ON

Hi Friends,
I am very happy to share this scenario with you all as it was difficult to find out what went wrong without Proper Data collection.

Before explaining what the issue is first I would like to give the details of the Environment.

This is Multi-Subnet with 3 nodes in one Data Center and the other 3 nodes in different (DR)

Configuration: NODE 1: 192 GB NODE 2: 192 GB NODE 3: 64 GB (Primary Data Center)
                         NODE 1: 192 GB NODE 2: 192 GB NODE 3: 64 GB (DR Data Center)

We have synchronous operation in place for NODE 1 and NODE 2 in Primary Data Center and the remaining all operating in Asynchronous mode.

Then we have Default instance and named instance where default instance should always run on NODE 1 and named should run on NODE 2. 

As the workloads are heavy we have given memory in the below manner

 Primary DC NODE  1 Default Instance(2012)  160 GB     Named Instance (2017) 8 GB
  NODE  2 Default Instance(2012)  12 GB Named Instance (2017) 125 GB
  NODE  3 Doesn't Have Default Named Instance (2017) 54 GB
 DR DC NODE  1 Default Instance(2012)  145 GB  Named Instance (2017) 9 GB
  NODE  2 Default Instance(2012)  16 GB Named Instance (2017) 145 GB
  NODE  3 Doesn't Have Default Named Instance (2017) 54 GB


Whenever an index maintenance operation gets triggered or any huge transaction happens on the database the log file would grow enormously as you know and this  really caused bottleneck .

Initially we were clueless why the log is not able to catch up on Secondaries however we have a data collection in place where we capture some useful metrics for every 1 min in to table.

Below is the snap where it shows how log transmission is happening in GB/Sec. (Log Send Rate)




At the same time we captured the growth of the log file too below is the snip for the same




So the question is why the replication is falling behind? As we have the data in place I went and analyzed the information and understood that NODE 1 (Named instance) in the primary Data center is not able to catch up with the transmission rate to that of NODE 2 (Named instance). I realized that this could be because of low memory (8 GB) that has been set.

But now the question is why it is not happening on NODE 1 on DR Data Center which has the same memory in  place?

Well the first thought that has come to my mind is it is because of asynchronous operation but I was not really sure hence I went with placing perfmon counters related to memory.

I went on to reproduce the issue by running alter index rebuild on  a larger table and below how it looks 




Here is how the Data looks like 

 (DC) free list stalls/sec PLE LAZY writes per second
 NODE 1 (PR) 9.2                       103 1265
 NODE 1 (DR) 4.5 119144
 NODE 2 (DR) 0 26816 0


if you Know the good values Free list stalls/sec should not be greater than 2 and PLE should be something like this ( Buffer pool memory in GB / 4 ) x 300  and Lazy writes per second should not be greater than 20.

Based on this Value we  can say that memory has not been set properly and in this case synchronous operation would indeed struggle a lot as the commit operation on the primary would depends on the speed at which it gets hardened on secondary.


This is my Analysis and please share your thoughts on the same.







                        
                    

Comments