Understanding Quorum in Windows Server Failover Cluster (WSFC)

Quorum acts as a definitive repository for the configuration information of physical clusters and in simple it can be called as cluster configuration database. When network problems occur, they can interfere with communication between cluster nodes.

Although the quorum is just a configuration database, it has two very important jobs. Firstly, it tells the cluster which node should be active and the other thing that the quorum does is to intervene when communications fail between nodes.

What’s a Failover Cluster Quorum & why it is required?

A Failover Cluster Quorum configuration specifies the number of failures that a cluster can support  in order to keep working. Once the threshold limit is reached, the cluster stops working. The most common failures in a cluster are Nodes that stop working or Nodes that can’t communicate anymore.

Normally, each node within a cluster can communicate with every other node in the cluster over a dedicated network connection. If this network connection were to fail though, the cluster would be split into two pieces (Split brain), each containing one or more functional nodes that cannot communicate with the nodes that exist on the other side of the communications failure.

When this type of communications failure occurs, the cluster is said to have been partitioned. The problem is that both partitions have the same goal; to keep the application(resources) running. The application can’t be run on multiple servers simultaneously though, so there must be a way of determining which partition gets to run the application(resource). This is where the quorum comes in. The partition that “owns” the quorum can continue running the application(resource). The other partition is removed from the cluster

In other words,

Quorum is design to handle the scenario when there is a problem with communication between sets of cluster nodes, so that two servers do not try to simultaneously host a resource group and write to the same disk at the same time.  This is known as a “split brain” and we want to prevent this to avoid any potential corruption to a disk my having two simultaneous group owners. 

By having this concept of quorum, the cluster will force the cluster service to stop in one of the subsets of nodes to ensure that there is only one true owner of a particular resource group

Imagine that quorum doesn’t exist and you have two-nodes cluster. Now there is a network problem and the two nodes can’t communicate. If there is no Quorum, what prevents both nodes to operate independently and take disks ownership on each side? This situation is called Split-Brain. Quorum exists to avoid Split-Brain and prevents corruption on disks.

To prevent the issues that are caused by a split in the cluster, the cluster software requires that any set of nodes running as a cluster must use a voting algorithm to determine whether, at a given time, that set has quorum.

The Quorum is based on a voting algorithm. Each node in the cluster has a vote. The cluster keeps working while more than half of the voters are online. This is the quorum (or the majority of votes). When there are too many of failures and not enough online voters to constitute a quorum, the cluster stop working.

How to Think of Quorum

It is more correct to think of quorum always being calculated from the perspective of each node on its own. If any given node does not believe that quorum is satisfied, it will voluntarily remove itself from the cluster. In so doing, it will take some action on the resources that it is running so that they can be taken over by other nodes.

In the case of Hyper-V, it will perform the configured Cluster-Controlled offline action specific to each virtual machine. If other nodes still have quorum, they’ll be able to take control of those virtual machines.

As for quorum,  there are multiple ways it can be configured. Each successive version of Windows/Hyper-V Server has introduced new options. We’ll take tour through the previous versions and end with the new features of 2012 R2.

Differences between quorum models in Windows Server 2003, 2008, 2008R2, 2012 and 2012 R2

Windows Server 2003 :

 Local Quorum: This cluster model is for clusters that consist of only one node. This model is also referred to as a local quorum. It is typically used for:

  • Deploying dynamic file shares on a single cluster node, to ease home directory deployment and administration.
  • Testing.
  • Development.

Standard Quorum: A standard quorum uses a quorum log file that is located on a disk hosted on a shared storage interconnect that is accessible by all members of the cluster. In this configuration, quorum disk should be online in order to provide cluster to be online.

In Windows Server 2003, Microsoft introduced a new type of quorum called the Majority Node Set Quorum (MNS). The thing that really sets a MNS quorum apart from a standard quorum is the fact that each node has its own, locally stored copy of the quorum database

When each node has its own copy of the database, geographically dispersed clusters become much more practical.

Window 2008 – Windows 2008 R2:

Note that for Windows 2003 clusters either each node can have a vote, and if majority of nodes are online, cluster will sustain or a shared disk can be used to decide if the cluster will sustain or not. “Disk witness” is only an option till Windows Server 2003.

By the time of 2008 & 2008 R2, a number of new Quorum options were available as “No majority- Disk only”, “Node Majority” , “Node and Disk Majority”, “Node and File Majority”

In Windows Server 2008 & 2008 R2 cluster, quorum vote is static, every item in the cluster is considered to have a vote. If the quorum mode is configured as witness-only, then there is only one vote, and if it is lost, then the entire cluster and all nodes in it will stop.

In node majority, each node gets a vote. So, for a four-node cluster in node majority plus disk witness, there are a total of five votes. The calculation for how many votes constitute a quorum is simple: 50% + 1. Because of that +1 need, you always want your cluster to have an odd number of votes

Different Quorum Configuration (Starting from 2008)

Below you can find the four-possible cluster configuration (taken from TechNet):

  • Node Majority (recommended for clusters with an odd number of nodes)
    • Can sustain failures of half the nodes (rounding up) minus one. For example, a seven node cluster can sustain three node failures.
  • Node and Disk Majority (recommended for clusters with an even number of nodes).
    • Can sustain failures of half the nodes (rounding up) if the disk witness remains online. For example, a six node cluster in which the disk witness is online could sustain three node failures.
    • Can sustain failures of half the nodes (rounding up) minus one if the disk witness goes offline or fails. For example, a six node cluster with a failed disk witness could sustain two (3-1=2) node failures.
  • Node and File Share Majority (for clusters with special configurations)
    • Works in a similar way to Node and Disk Majority, but instead of a disk witness, this cluster uses a file share witness.
    • Note that if you use Node and File Share Majority, at least one of the available cluster nodes must contain a current copy of the cluster configuration before you can start the cluster. Otherwise, you must force the starting of the cluster through a particular node. For more information, see “Additional considerations” in Start or Stop the Cluster Service on a Cluster Node.
  • No Majority: Disk Only (not recommended)
    • Can sustain failures of all nodes except one (if the disk is online). However, this configuration is not recommended because the disk might be a single point of failure.

Windows Server 2012:

The quorum models are the same as Windows 2008 R2, however there are enhancements with Node Vote Assignment by introducing a new concept called as “Dynamic Quorum Configuration” in 2012

The following features in Windows Server 2012 enhance the management and functionality of the cluster quorum:

Configure Cluster Quorum Wizard. Simplifies quorum configuration and integrates well with new features and existing quorum functionality.

Vote assignment. Allows specifying which nodes have votes in determining quorum (by default, all nodes have a vote).

Dynamic quorum. Gives the administrator the ability to automatically manage the quorum vote assignment for a node, based on the state of the node. When a node shuts down or crashes, the node loses its quorum vote. When a node successfully rejoins the cluster, it regains its quorum vote.

By dynamically adjusting the assignment of quorum votes, the cluster can increase or decrease the number of quorum votes that are required to keep running. This enables the cluster to maintain availability during sequential node failures or shutdowns.

The Dynamic Quorum is enabled by default since Windows Server 2012 and the  Implementation of Dynamic Quorum is a huge improvement as it is possible to continue to run a cluster even if the number of nodes remaining in the cluster is less than 50%.

Dynamic Quorum enables to assign vote to node dynamically to avoid to lose the majority of votes and so the cluster can run with one node (known as last-man standing).

The vote assignment for all cluster nodes can be verified by using the Validate Cluster Quorum validation test.

The impact of dynamic quorum is interesting. If you run the wizard to automatically configure quorum for a two-node cluster in 2012, it will recommend that you choose node majority only, whereas 2008 R2 would have wanted node majority plus a witness. This is because if either node exits(out) from the cluster voluntarily, dynamic quorum will still understand that everything is OK

Note that if you have 2 remaining nodes, the cluster will sustain with50 % chances i.e, it may or may not sustain. This drawback has been resolved in Windows 2012 R2, see in Dynamic Witness part.

Windows 2012 R2:

In 2012 only the nodes were dynamic , whereas in 2012R2 , witness disk made as dynamic vote which is called “Dynamic Witness”. The big change with this is that it’s always recommended to configure a cluster with a witness regardless of the number of nodes in the cluster. The cluster can decide on the fly if that witness should have a vote or not.

The witness vote is also dynamically adjusted based on the number of voting nodes in current cluster membership. If there are an odd number of votes, the quorum witness does not have a vote. If there is an even number of votes, the quorum witness has a vote.

The quorum witness vote is also dynamically adjusted based on the state of the witness resource. If the witness resource is offline or failed, the cluster sets the witness vote to “0.”

In Windows Server 2012 R2, you can now view the assigned quorum vote and the current quorum vote for each cluster node in the Failover Cluster Manager user interface (UI).

Quorum Userinterface Improvement:

The assigned node vote values can be seen from UI starting with Windows Server 2012 R2:

Dynamic Quorum Enhancements:

   1. Force Quorum Resiliency:

If there is a partitioned cluster in Windows Server 2012, after connectivity is restored, you must manually restart any partitioned nodes that are not part of the forced quorum subset with the /pq switch to prevent quorum. Ideally, you should do this as quickly as possible.

In Windows Server 2012 R2, both sides have a view of cluster membership and they will automatically reconcile when connectivity is restored. The side that you started with force quorum is deemed authoritative and the partitioned nodes automatically restart with the /pq switch to prevent quorum.

 2. Dynamic Witness:

In Windows Server 2012 R2, the cluster is configured to use dynamic quorum configuration by default, in addition to that, the witness vote is also dynamically adjusted based on the number of voting nodes in current cluster membership.

If there are an odd number of votes, the quorum witness does not have a vote. If there is an even number of votes, the quorum witness has a vote. The quorum witness vote is also dynamically adjusted based on the state of the witness resource. If the witness resource is offline or failed, the cluster sets the witness vote to “0.”

Dynamic witness significantly reduces the risk that the cluster will go down because of witness failure.

The cluster decides whether to use the witness vote based on the number of voting nodes that are available in the cluster. 

In Windows Server 2012, if there is a 50% split where neither site has quorum, both sides will go down.

In Windows Server 2012 R2, you can assign the LowerQuorumPriorityNodeID cluster common property to a cluster node in the secondary site so that the primary site stays running. Set this property on only one node in the site.

To set the property, start Windows PowerShell as an administrator, and then enter the following command, where “1” is the example node ID for a node in the site that you consider less critical:

To get Node ID -> Get-ClusterNode | ft

To SetLower ID -> (Get-Cluster).LowerQuorumPriorityNodeID = 1 (where “1” is the example node ID for a node in the site that you consider less critical)

To check the status of Dynamic Quorum and Dynamic Weight

Get-ClusterNode -Name * | ft NodeName,DynamicWeight,NodeWeight -AutoSize

Get-Cluster | ft name, DynamicQuorum, WitnessDynamicWeight  (WitnessDynamicWeight means  Dynamic Witness)

To disable Dynamic Quorum -> (Get-Cluster).DynamicQuorum = 0

A value of 0 indicates that the node does not have a quorum vote. A value of 1 indicates that the node has a quorum vote.

Example:

Out of 4 nodes, 3 Nodes Down  and Disk Witness available -> Cluster continue to function due to dynamic quorum assuming all resources running on one node

in 4 nodes, If  3 Nodes & Disk Witness fails -> Cluster will not continue running and the cluster service will terminate because we can no longer achieve quorum.  We will not dynamically adjust the votes below three in a multi-node cluster with a witness, so that means we need two votes active to continue functioning

3. Tie-breaker for %50 Node Split:

Starting with Windows 2012 R2,  a cluster can dynamically adjust a running node’s vote to keep  the total number of votes at an odd number. This functionality works seamlessly with dynamic witness. To maintain an odd number of votes, a cluster will first adjust the quorum witness vote through dynamic witness. However, if a quorum witness is not available, the cluster can adjust a node’s vote. For example:

  1. You have a six node cluster with a file share witness. The cluster stretches across two sites with three nodes in each site. The cluster has a total of seven votes.
  2. The file share witness fails. Because the cluster uses dynamic witness, the cluster automatically removes the witness vote. The cluster now has a total of six votes.
  3. To maintain an odd number of votes, the cluster randomly picks a node to remove its quorum vote. One site now has two votes, and the other site has three.
  4. A network issue disrupts communication between the two sites. Therefore, the cluster is evenly split into two sets of three nodes each. The partition in the site with two votes goes down. The partition in the site with three votes continues to function.

Dynamic Quorum and Dynamic Witness both are different, Dynamic Witness implemented in 2012R2

Dynamic Quorum enables to assign vote to node dynamically to avoid to lose the majority of votes and so the cluster can run with one node (known as last-man standing).

There are two types of Dynamic Witnesses and only one can be configured in your cluster at a time, either a File Share Witness (FSW) or Disk Witness. If your cluster is built on the same subnet, you may configure either type of witness. But if your cluster crosses subnets, it’s recommended to configure a FSW because a Dynamic Witness is a voting element and should be seen by all nodes.

FSW is a share on a server and it’s recommended that this be on a separate server and possibly a different data center from the cluster nodes. This would allow any cluster node to reach the file share server in the scenario of site-to-site network failure.

The dynamic quorum has been enhanced in Windows Server 2012 R2. Now there is the Dynamic Witness implemented. This feature calculate if the Quorum Witness has a vote. There is two cases:

  • If there is an even number of node in the cluster with the dynamic quorum enabled, the Dynamic Witness is enabled on the Quorum Witness and so the witness has vote.
  • If there is an odd number of node in the cluster with the dynamic quorum enabled, the Dynamic Witness is enabled on the Quorum Witness and so the witness has not vote

This change also greatly simplifies quorum witness configuration. You no longer have to determine whether to configure a quorum witness because the recommendation in Windows Server 2012 R2 is to always configure a “Quorum Witness”. The cluster automatically determines when to use it.

In simple after referring multiple blogs, I recommend this when you configure a Quorum in a failover cluster:

Prior to Windows Server 2012 R2, always keep an odd majority of vote

    • In case of an even number of nodes, implement a witness
    • In case of an odd number of nodes, do not implement a witness

Since Windows Server 2012 R2, always implement a quorum witness

    • Dynamic Quorum manage the assigned vote to the nodes
    • Dynamic Witness manage the assigned vote to the Quorum Witness
In case of stretched cluster, implement the witness in a third room or use Microsoft Azure.

Reference:

Behavior of Dynamic Witness on Windows Server 2012 R2 Failover Clustering – Click to PDF
Differences between quorum models in windows 2003, 2008, 2008R2, 2012 and 2012 R2  – Click to PDF
Quorum in Microsoft Failover Clusters
Understand Failover Cluster Quorum
What’s New in Failover Clustering in Windows Server 2012R2 – Click to PDF