All the VM’s are failing over(restarting) unexpectedly from one node to another in 15 Node Hyper-v cluster, post storage firmware up gradation.

Environment:
OS: Windows Server 2012
Model: IBM Flex System 8721 (Chassis)
Hyper-v Servers are 2 Chassis
Storage : IBM SVC7000 over FC
Multipathing existed with IBM DSM

Immediate Observations:
  VM's were impacted post storage firmware upgradation activity
  Post activity completion of 2-4 hours , observed the event id 5120(Status_IO_TimeOut)  & 5142 for all CSV's at different  timings
  Observed continuous event id's 129 & 153 on all Hyper-v base servers from the time storage activity started

hyper-v-event-5120

hyper-v-event-5142

hyper-v-event-129

hyper-v-event-153

hyper-v-event-5

Immediate Action’s performed

Planned to start rebooting of all Hyper-v servers one by one ,initially started rebooting of Coordinator node where the Hyper-v is owning the CSV disk to release the locks and to control the VM’s failover immediately.
Post rebooted of Hyper-v hosts , started moving CSV disk to the server which we rebooted. Post starting of 3 or 4 Hyper-v servers, VM’s failover is controlled . However, observed few VM’s were not able to move or failover manually due to lock’s.
Therefore , as a good practice restarted all Hyper-v servers so that storage paths will be reestablished without any issues.

Post resolving the issues, we started to find the root cause of multipathing failure

We have analyzed as below based on the above event id’s 129,153,5120 & 5142.

Each Cluster node will have direct access to a CSV LUN as well as redirected access over the network and through the node that is the coordinator(owner) of the CSV resource. 5120 errors indicate a failure of redirected I/O, and a 5142 indicates a failure of both redirected and direct.

Warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the Source. Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request. Dropped requests can be caused by faulty routers or other hardware problems on the SAN. If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network

An event 153 is similar to an event 129. An event 129 is logged when the storport driver times out a request to the disk. The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request.

The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.

Finally we clearly understood that , between MPIO (IBM DSM) & HBA driver there was a connectivity issue somewhere in the storage stack driver and involved Storage vendor to do deep analysys from storage end.

From Storage team, we came to know that before storage upgradation activity , Read/Write abnormalities found on volumes i.e, huge Read/write latency found, however they fixed the same before upgradation.

By above statement & referring few blogs , we understood that , in the Draining state volume pends all new IOs and any failed IOs. As storage vendor confirmed that the read/write latency on volumes found abnormal, it would have caused delay in completing I/O for CSV volume and went in to pause state/IO Timeout errors.

There is one timer per logical unit and it is initialized to -1. When the first request is sent to the miniport the timer is set to the timeout value in the SRB.

The timer is decremented once per second. When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue. So, as long as requests complete the timer will never go to zero. If the timer does go to zero, it means the device has stopped responding. That is when the STORPORT driver logs the Event ID 129 error. STORPORT then has to take corrective action by trying to reset the unit.

Also, it is recommended to upgrade HBA driver as it is oldest and CVSFLT.sys,CVSFS.sys by following KB3013767

Ref:

https://blogs.msdn.microsoft.com/ntdebugging/2011/05/06/understanding-storage-timeouts-and-event-129-errors/

https://blogs.msdn.microsoft.com/clustering/2014/12/08/troubleshooting-cluster-shared-volume-auto-pauses-event-5120/

https://blogs.msdn.microsoft.com/clustering/2014/02/26/event-id-5120-in-system-event-log/

All the VM’s are failing over(restarting) unexpectedly from one node to another in 15 Node Hyper-v cluster, post storage firmware up gradation.

Latest News