Virtualization - Cloud

Category: Hyper-V (Page 2 of 2)

In 15 Node cluster, majority of the virtual machines failed over(restarted) due to network fluctuations while performing network maintenance activity

Environment:
OS: Windows Server 2012
Model: IBM Flex System 8721 (Chasis) Hyper-v Servers are in 2 Chassis
Network : 2 (Public & private), 2 different Switches

Immediate Observations:

  • During network core switch activity, there was a network disturbance for both public & private network interfaces for 30 to 60 sec’s and it is fluctuated in duration of 2 hours
  • Observed the event id’s 1127(network fail),1135(Removal of cluster membership),1177(quorum lost) & 5120(CSV disconnection)

hyper-v-event-1127

hyper-v-event-csv-disconnection

hyper-v-event-1135

hyper-v-event-1177

 

Immediate Action’s performed

  • As heartbeat is not available & other interface not interrupting , we have increased the heartbeat value by increasing the Same Subnet Threshold  from default 5 sec to 28 Sec.  This value is kept by referring the below blog
  • As per Microsoft recommendation we should not have  samesubnetthreshold value > 20 sec.  Since we observed VLAN flapping is taking 25 sec, we have set the value to 28 to check  , we have changed samesubnetThreshold to 28 which is heartbeat value.  However we can’t set routehistorylength value more than 40 due to limitation.

hyper-vclus

SameSubnetDelay = 1000 ( means 1000 millsec  i.e, 1 sec)
SameSubnetThreshold = 5 (Means , 5 heartbeats)

So, By default total heart beat in a cluster is = Delay * Threshold = 1 sec * 5 HB i.e. cluster can tolerate of 5 HB in 5 sec

Delay – This defines the frequency at which cluster heartbeats are sent between nodes.  The delay is the number of seconds before the next heartbeat is sent.

Threshold – This defines the number of heartbeats which are missed before the cluster takes recovery action

For example setting SameSubnetDelay to send a heartbeat every 2 seconds and setting the SameSubnetThreshold to 10 heartbeats missed before taking recovery, means that the cluster can have a total network tolerance of 20 seconds(2sec *10 HB) before recovery action is taken

hyper-v-disclaimer

Changing of above 28 HB’s would not help us our requirement as network fluctuations are more than 30 sec hence we have  sought Microsoft support for any other best practice and got below recommendations

  • The samesubnedelay and samesubnetthreashold values are very specific to heartbeat settings between cluster nodes/hosts. These changes will help delay the heartbeat checks between the nodes/hosts.
  •  The above changes will not control the way the SMB multi-channel that is used by the Cluster Shared Volumes (CSV). The moment the TCP connection is dropped during the network maintenance activity, the SMB channel will have an impact.
  •  As the SMB connection drops, it will affect the VMs hosted on the CSV volumes. They will not be able to get the metadata for the CSV over the SMB channel.
  •  Due to problems over CSV network (SMB channel), you would see event ID 5120 for CSV volumes. This will impact the VM availability.

So based on the above points, if there is a network outage beyond 10-20 seconds, there will be impact to the cluster and there is no way to avoid this impact on the VM resources. It is recommended to ensure the VMs are moved to the nodes where will be no network impact or to be brought offline gracefully before the network maintenance activity.

Also, MS responded as below for the query which we asked to check and it may not viable in CSV environment as SMB connection drops impacting CSV

Removal and adding of the VM from HA is time consuming and not an easy option and Microsoft do not suggest this option in CSV environment.

Ref: https://blogs.msdn.microsoft.com/clustering/2012/11/21/tuning-failover-cluster-network-thresholds/

All the VM’s are failing over(restarting) unexpectedly from one node to another in 15 Node Hyper-v cluster, post storage firmware up gradation.

Environment:
OS: Windows Server 2012
Model: IBM Flex System 8721 (Chassis)
Hyper-v Servers are 2 Chassis
Storage : IBM SVC7000 over FC
Multipathing existed with IBM DSM

Immediate Observations:
  VM's were impacted post storage firmware upgradation activity
  Post activity completion of 2-4 hours , observed the event id 5120(Status_IO_TimeOut)  & 5142 for all CSV's at different  timings
  Observed continuous event id's 129 & 153 on all Hyper-v base servers from the time storage activity started

hyper-v-event-5120

hyper-v-event-5142

hyper-v-event-129

hyper-v-event-153

hyper-v-event-5

Immediate Action’s performed

  • Planned to start  rebooting of all Hyper-v servers one by one ,initially started rebooting of Coordinator node where the Hyper-v is owning the CSV disk to release the locks and to control the VM’s failover immediately.
  • Post rebooted of  Hyper-v hosts , started moving CSV disk to the server which we rebooted. Post starting of 3 or 4 Hyper-v servers, VM’s failover is controlled . However, observed few VM’s were not able to move or failover manually due to lock’s.
  • Therefore , as a good practice restarted all Hyper-v servers so that storage paths will be reestablished without any issues.

Post resolving the issues, we started to find the root cause  of multipathing failure

We have analyzed as below based on the above event id’s 129,153,5120 & 5142.

Each Cluster node will have direct access to a CSV LUN as well as redirected access over the network and through the node that is the coordinator(owner) of the CSV resource. 5120 errors indicate a failure of redirected I/O,  and a 5142 indicates a failure of both redirected and direct.

Warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the Source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network

An event 153 is similar to an event 129.  An event 129 is logged when the storport driver times out a request to the disk. The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request.

The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.

Finally we clearly understood that , between MPIO (IBM DSM) & HBA driver there was a connectivity issue somewhere in the storage stack driver and involved Storage vendor to do deep analysys from storage end.

From Storage team, we  came to know that before storage upgradation activity , Read/Write abnormalities found on volumes i.e, huge Read/write latency found, however they fixed the same before upgradation.

By above statement  & referring few blogs , we understood that ,  in the Draining state volume pends all new IOs and any failed IOs. As storage vendor confirmed that the read/write latency on volumes found abnormal, it would have caused delay in completing I/O for CSV volume and went in to pause state/IO Timeout errors.

There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.

The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.

Also, it is recommended to upgrade HBA driver as it is oldest and CVSFLT.sys,CVSFS.sys by following KB3013767

Ref:

https://blogs.msdn.microsoft.com/ntdebugging/2011/05/06/understanding-storage-timeouts-and-event-129-errors/

https://blogs.msdn.microsoft.com/clustering/2014/12/08/troubleshooting-cluster-shared-volume-auto-pauses-event-5120/

https://blogs.msdn.microsoft.com/clustering/2014/02/26/event-id-5120-in-system-event-log/

 

 

Unable to Start any VM in one of the Hyper-V node cluster

Issue:
Unable to Start any VM in one of the Hyper-V node(Windows Server 2012) cluster.

Observations

  • Issue started post upgradation of Symantec, Symantec upgradation is not successful
  • Unable to migrate any VM from another node in a cluster
  • Unable to start any VM in that node. After clicking start, VM is in Starting stage and throwing error after 2-5 minutes.

Error Messages Seen:
TEST failed to start worker process: “server execution failed (0x80080005)”

Troubleshooting done:

  • Given full control to Everyone on  below regkey: HKCR\AppID\{8BC3F05E-D86B-11D0-A075-00C04FB68820}
  • Post giving permission, able to start VM on the host.
  • Compared the registry key permissions with other working machine and found Creator Owner was missing on problem machine. I added it and removed the everyone. Able to start VM
  • To fix WMI and DCOM errors, I  re-registered the dlls  by running below command ( for /f %s in (‘dir /b *.dll’) do regsvr32 /s %s ) to resolve the issue.

Attached is the detailed document with all screenshots. Unable to Start VM -Doc

Enabling Jumbo Frames on CISCO UCS blades -Hyperv

How to enable Jumbo Frames for Hyper-v hosted on CISCO UCS blades

Jumbo Frames setting can enable from UCS manager and no need to perform any changes from windows end if servers hosted on CISCO UCS blades

You need to make 3 changes:

  • Set the System Class MTU to 9216
  • Create a QoS policy for the MTU
  • Set the vNIC to have 9000 MTU and QoS policy you have created

To configure Jumbo Frames on UCS it is done as a QoS policy and the configuration guide is in the link below:

http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/gui/config/guide/2-2/b_UCSM_GUI_Configuration_Guide_2_2/configuring_quality_of_service.html

Whilst you are planning to use Hyper-V as your OS, the following configuration guide is quite useful to understand which components on the UCS you need to configure to enable Jumbo Frames:

http://www.cisco.com/c/en/us/support/docs/servers-unified-computing/ucs-b-series-blade-servers/117601-configure-UCS-00.html

Find the document with screenshots

Document -Jumbo Frames enablement-CISCO UCS

VHD/X Disk Compact/Shrink Difference between 2008 & 2012 Version’s

Difference between Compact & Shrink in Windows Server 2008 & 2012 versions is,  in Win2008( for VHD)  it  compact’s & shrink if you select Compact option whereas in 2012(for VHDX) we need to shrink if you want to reduce the VHDX size and compact will do only compacts.

Note:

  • You will get shrink option only for VHDX format files.
  • You can run Power shell command to shrink VHD files – This is Good for VHD format files as it compacts and shrinks from command prompt
    • Mount-VHD -Path “C:\ClusterStorage\Volume7\Test\test.vhd” -Readonly   –> This will mount the VHD (VM need to be turned off)
    • Optimize-VHD -Path “C:\ClusterStorage\Volume7\Test\test.vhd” -Mode Quick   –> This will do compacts & shrinks
    • Dismount-VHD “C:\ClusterStorage\Volume7\Test\test.vhd” –> This will dismount the VHD from Disk Management

Hyper-v PS1

Windows Server 2012 ->Edit Disk  Option for VHDX type of Files Hyper-v VHDX-Pic1 Windows Server 2008->Edit Disk  Option for VHD type of Files

Hyper-v VHD-Pic2

Note:

  • Above POST created based on personal experience & knowledge for personal reuse.
  • In case, if you wish you to use the above article, use the above steps upon proper testing and reader will be responsible for any outcome.
Newer posts »

© 2024 Tech Blog

Theme by Anders NorenUp ↑