Virtualization - Cloud

Day: November 26, 2016

Hyper-V Live Migration terms: brownout, blackout and dirty pages

You may not know about brownouts, blackouts and dirty pages in Hyper-V Live Migrations, but they are useful for monitoring virtual machine live migration

Hyper-V Live Migration is undoubtedly the most sought-after feature of Hyper-V because of its ability to move virtual machines (VMs) between clustered hosts without noticeable service interruption. But in fact, Live Migration can cause brief disruptions in service that end users may not notice.

As an admin, you should understand some lesser-known Hyper-V Live Migration terms that help monitor and troubleshoot service interruption.

Hyper-V event logs contain information about live migration disruptions that can briefly affect VMs. For every VM live migration, these logs report three events: a brownout event, a blackout and dirty-pages event, and a summary of the live-migration process. Understanding these terms also helps you troubleshoot live migrations that take too long and prevent administrative tasks

You’ll find the Live Migration logs in the Application and Services Log -> Microsoft -> Windows -> Hyper-V-Worker

These Hyper-V Live Migration terms are numbered as follows:

Brownout event -22508
Blackout and dirty-pages event -22509
Black out event -20415 ( This is successful event id with blackout time)
Live migration summary event (22507)
Successful Live migration event 20418

hyper-v-livesuccess

 

A Live Migration brownout event 

A Hyper-V-Worker event log lists the brownout stage first. In the context of virtualization, a brownout is defined as the amount of time it takes to complete the memory-transfer portion of Hyper-V Live Migration. And the term brownout is a good metaphor for this event, because a VM is not affected completely (as the term blackout suggests). The VM is still responsive, but you can’t perform configuration changes or other administrative functions during this stage of the live migration

 hyper-v-live-brown

Above figure indicates that the brownout took 19.43 seconds. This time depends on the size of active RAM the VM uses and the speed of the Live Migration transport network. During this time, the VM is completely responsive as the memory pages move to the destination node. This stage of live migration gets most of a VM’s state over to another node, but not quite all. Since the VM is responsive, users most likely never know that a migration to another node is in process. But VM response may get delayed. You can monitor this delay by constantly pinging the VM with the command ping SERVERNAME –t. You’ll notice brief periods of longer response times, without total disruption of service

Live Migration blackout and dirty-pages event

The final stage in Hyper-V Live Migration is when a VM fully migrates to the destination node of the cluster. This process is called the blackout stage, where, to finally move a VM and all its memory, there is a brief pause in service. During the brownout stage, the host attempts to move all active memory to the destination node. But server memory isn’t completely emptied until this final process, where data is moved to the destination node. A final snapshot provides a last file representation of the remaining memory, which is known as dirty pages. When dirty pages are migrated to the destination node, the blackout occurs.

 

The blackout period is by no means comparable to the longer saved state in Hyper-V’s former Quick Migration feature, because Live Migration usually moves a very small amount of data during this final stage. But a slight disruption will occur, usually about one to two seconds, or one dropped ping. Unlike during the brownout stage, a VM is not responsive. The event log indicates how long the blackout period was and how many dirty pages were moved during the migration’s final stage (see below Figure)

 

hyper-v-live-blackout

Note that during a live migration for servers with a higher transaction workload, longer blackout times and a greater number of dirty pages occur.

These two Hyper-V Live Migration terms are important, because the blackout and dirty-pages event are troubleshooting tools. The log tells you for how long a VM was unavailable, which is useful information when a live migration takes longer than expected or when there is a noticeable disruption in service

Live Migration summary event

The final event, 22507, gives a nice summary of the duration of the live migration process

hyper-v-live-summary

Note: Above article written based on Hyper-v 2008 r2 and it is applicable to later version too..

Multiple Virtual Machine’s went into paused in a Hyper-v cluster throwing error “Disk(s) running out of space” – Typical Issue

Environment:
OS: Windows Server 2012 ,10 Node Hyper-v cluster
Model: ProLiant BL460c Gen8
Storage: FC,HP 3PAR
Error Message:
Multiple Virtual Machine’s went into paused state throwing error “Disk(s) running out of space though 30% free space available

Immediate Observations:

  • Only few Virtual machines went to paused state and observed all VM’s are coming from single volume let’s say Volume 1
  • Volume1 is having 30 % free space , checked with storage team whether the LUN is thick or thin as sometimes LUN may be exhausted if they allocated as thin. It is a thick LUN and no issues observed from storage
  • Tried to resume the VM, after sometime VM moving to pause state
  • Tried to move the CSV disk from one node to another – No Luck
  • Observed that there is no single recommended hotfix installed on any node in a cluster but all nodes are up to date with all critical patches

Resolution

  • Based on above findings, assumed that there is some known issues would be there as  error is misleading us to  DISK’s are running out of space though 30% free space available. 
  • As there was no single recommended hotfix  installed in cluster , initially installed Hype-v 2012 hotfixes on one node and taken restart.
  • Move the Victim volume to the recently restarted node. – No issues  observed
  • Post keeping an observation, installed all recommended hotfixes

Finally, I am unsure which hotfix resolved exactly but I have seen below forums where these issues addressed in KB2791729 but it is a private hotfix

Ref:
http://support.microsoft.com/kb/2784261 – Windows Server 2012 Hotfixes

In 15 Node cluster, majority of the virtual machines failed over(restarted) due to network fluctuations while performing network maintenance activity

Environment:
OS: Windows Server 2012
Model: IBM Flex System 8721 (Chasis) Hyper-v Servers are in 2 Chassis
Network : 2 (Public & private), 2 different Switches

Immediate Observations:

  • During network core switch activity, there was a network disturbance for both public & private network interfaces for 30 to 60 sec’s and it is fluctuated in duration of 2 hours
  • Observed the event id’s 1127(network fail),1135(Removal of cluster membership),1177(quorum lost) & 5120(CSV disconnection)

hyper-v-event-1127

hyper-v-event-csv-disconnection

hyper-v-event-1135

hyper-v-event-1177

 

Immediate Action’s performed

  • As heartbeat is not available & other interface not interrupting , we have increased the heartbeat value by increasing the Same Subnet Threshold  from default 5 sec to 28 Sec.  This value is kept by referring the below blog
  • As per Microsoft recommendation we should not have  samesubnetthreshold value > 20 sec.  Since we observed VLAN flapping is taking 25 sec, we have set the value to 28 to check  , we have changed samesubnetThreshold to 28 which is heartbeat value.  However we can’t set routehistorylength value more than 40 due to limitation.

hyper-vclus

SameSubnetDelay = 1000 ( means 1000 millsec  i.e, 1 sec)
SameSubnetThreshold = 5 (Means , 5 heartbeats)

So, By default total heart beat in a cluster is = Delay * Threshold = 1 sec * 5 HB i.e. cluster can tolerate of 5 HB in 5 sec

Delay – This defines the frequency at which cluster heartbeats are sent between nodes.  The delay is the number of seconds before the next heartbeat is sent.

Threshold – This defines the number of heartbeats which are missed before the cluster takes recovery action

For example setting SameSubnetDelay to send a heartbeat every 2 seconds and setting the SameSubnetThreshold to 10 heartbeats missed before taking recovery, means that the cluster can have a total network tolerance of 20 seconds(2sec *10 HB) before recovery action is taken

hyper-v-disclaimer

Changing of above 28 HB’s would not help us our requirement as network fluctuations are more than 30 sec hence we have  sought Microsoft support for any other best practice and got below recommendations

  • The samesubnedelay and samesubnetthreashold values are very specific to heartbeat settings between cluster nodes/hosts. These changes will help delay the heartbeat checks between the nodes/hosts.
  •  The above changes will not control the way the SMB multi-channel that is used by the Cluster Shared Volumes (CSV). The moment the TCP connection is dropped during the network maintenance activity, the SMB channel will have an impact.
  •  As the SMB connection drops, it will affect the VMs hosted on the CSV volumes. They will not be able to get the metadata for the CSV over the SMB channel.
  •  Due to problems over CSV network (SMB channel), you would see event ID 5120 for CSV volumes. This will impact the VM availability.

So based on the above points, if there is a network outage beyond 10-20 seconds, there will be impact to the cluster and there is no way to avoid this impact on the VM resources. It is recommended to ensure the VMs are moved to the nodes where will be no network impact or to be brought offline gracefully before the network maintenance activity.

Also, MS responded as below for the query which we asked to check and it may not viable in CSV environment as SMB connection drops impacting CSV

Removal and adding of the VM from HA is time consuming and not an easy option and Microsoft do not suggest this option in CSV environment.

Ref: https://blogs.msdn.microsoft.com/clustering/2012/11/21/tuning-failover-cluster-network-thresholds/

All the VM’s are failing over(restarting) unexpectedly from one node to another in 15 Node Hyper-v cluster, post storage firmware up gradation.

Environment:
OS: Windows Server 2012
Model: IBM Flex System 8721 (Chassis)
Hyper-v Servers are 2 Chassis
Storage : IBM SVC7000 over FC
Multipathing existed with IBM DSM

Immediate Observations:
  VM's were impacted post storage firmware upgradation activity
  Post activity completion of 2-4 hours , observed the event id 5120(Status_IO_TimeOut)  & 5142 for all CSV's at different  timings
  Observed continuous event id's 129 & 153 on all Hyper-v base servers from the time storage activity started

hyper-v-event-5120

hyper-v-event-5142

hyper-v-event-129

hyper-v-event-153

hyper-v-event-5

Immediate Action’s performed

  • Planned to start  rebooting of all Hyper-v servers one by one ,initially started rebooting of Coordinator node where the Hyper-v is owning the CSV disk to release the locks and to control the VM’s failover immediately.
  • Post rebooted of  Hyper-v hosts , started moving CSV disk to the server which we rebooted. Post starting of 3 or 4 Hyper-v servers, VM’s failover is controlled . However, observed few VM’s were not able to move or failover manually due to lock’s.
  • Therefore , as a good practice restarted all Hyper-v servers so that storage paths will be reestablished without any issues.

Post resolving the issues, we started to find the root cause  of multipathing failure

We have analyzed as below based on the above event id’s 129,153,5120 & 5142.

Each Cluster node will have direct access to a CSV LUN as well as redirected access over the network and through the node that is the coordinator(owner) of the CSV resource. 5120 errors indicate a failure of redirected I/O,  and a 5142 indicates a failure of both redirected and direct.

Warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the Source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network

An event 153 is similar to an event 129.  An event 129 is logged when the storport driver times out a request to the disk. The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request.

The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.

Finally we clearly understood that , between MPIO (IBM DSM) & HBA driver there was a connectivity issue somewhere in the storage stack driver and involved Storage vendor to do deep analysys from storage end.

From Storage team, we  came to know that before storage upgradation activity , Read/Write abnormalities found on volumes i.e, huge Read/write latency found, however they fixed the same before upgradation.

By above statement  & referring few blogs , we understood that ,  in the Draining state volume pends all new IOs and any failed IOs. As storage vendor confirmed that the read/write latency on volumes found abnormal, it would have caused delay in completing I/O for CSV volume and went in to pause state/IO Timeout errors.

There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.

The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.

Also, it is recommended to upgrade HBA driver as it is oldest and CVSFLT.sys,CVSFS.sys by following KB3013767

Ref:

https://blogs.msdn.microsoft.com/ntdebugging/2011/05/06/understanding-storage-timeouts-and-event-129-errors/

https://blogs.msdn.microsoft.com/clustering/2014/12/08/troubleshooting-cluster-shared-volume-auto-pauses-event-5120/

https://blogs.msdn.microsoft.com/clustering/2014/02/26/event-id-5120-in-system-event-log/

 

 

© 2024 Tech Blog

Theme by Anders NorenUp ↑