In 15 Node cluster, majority of the virtual machines failed over(restarted) due to network fluctuations while performing network maintenance activity

Environment:

OS: Windows Server 2012

Model: IBM Flex System 8721 (Chasis) Hyper-v Servers are in 2 Chassis

Network : 2 (Public & private), 2 different Switches

Immediate Observations:

During network core switch activity, there was a network disturbance for both public & private network interfaces for 30 to 60 sec’s and it is fluctuated in duration of 2 hours
Observed the event id’s 1127(network fail),1135(Removal of cluster membership),1177(quorum lost) & 5120(CSV disconnection)

hyper-v-event-1127

hyper-v-event-csv-disconnection

hyper-v-event-1135

hyper-v-event-1177

Immediate Action’s performed

As heartbeat is not available & other interface not interrupting , we have increased the heartbeat value by increasing the Same Subnet Threshold from default 5 sec to 28 Sec. This value is kept by referring the below blog
As per Microsoft recommendation we should not have samesubnetthreshold value > 20 sec. Since we observed VLAN flapping is taking 25 sec, we have set the value to 28 to check , we have changed samesubnetThreshold to 28 which is heartbeat value. However we can’t set routehistorylength value more than 40 due to limitation.

hyper-vclus

SameSubnetDelay = 1000 ( means 1000 millsec i.e, 1 sec)

SameSubnetThreshold = 5 (Means , 5 heartbeats)

So, By default total heart beat in a cluster is = Delay * Threshold = 1 sec * 5 HB i.e. cluster can tolerate of 5 HB in 5 sec

Delay – This defines the frequency at which cluster heartbeats are sent between nodes. The delay is the number of seconds before the next heartbeat is sent.

Threshold – This defines the number of heartbeats which are missed before the cluster takes recovery action

For example setting SameSubnetDelay to send a heartbeat every 2 seconds and setting the SameSubnetThreshold to 10 heartbeats missed before taking recovery, means that the cluster can have a total network tolerance of 20 seconds(2sec *10 HB) before recovery action is taken

hyper-v-disclaimer

Changing of above 28 HB’s would not help us our requirement as network fluctuations are more than 30 sec hence we have sought Microsoft support for any other best practice and got below recommendations

The samesubnedelay and samesubnetthreashold values are very specific to heartbeat settings between cluster nodes/hosts. These changes will help delay the heartbeat checks between the nodes/hosts.
The above changes will not control the way the SMB multi-channel that is used by the Cluster Shared Volumes (CSV). The moment the TCP connection is dropped during the network maintenance activity, the SMB channel will have an impact.
As the SMB connection drops, it will affect the VMs hosted on the CSV volumes. They will not be able to get the metadata for the CSV over the SMB channel.
Due to problems over CSV network (SMB channel), you would see event ID 5120 for CSV volumes. This will impact the VM availability.

So based on the above points, if there is a network outage beyond 10-20 seconds, there will be impact to the cluster and there is no way to avoid this impact on the VM resources. It is recommended to ensure the VMs are moved to the nodes where will be no network impact or to be brought offline gracefully before the network maintenance activity.

Also, MS responded as below for the query which we asked to check and it may not viable in CSV environment as SMB connection drops impacting CSV

Removal and adding of the VM from HA is time consuming and not an easy option and Microsoft do not suggest this option in CSV environment.

Ref: https://blogs.msdn.microsoft.com/clustering/2012/11/21/tuning-failover-cluster-network-thresholds/

In 15 Node cluster, majority of the virtual machines failed over(restarted) due to network fluctuations while performing network maintenance activity

Latest News