Virtualization - Cloud

Author: Ram Prasad (Page 6 of 8)

Hyperv-VM Snapshot Deletion Activity -1.9 TB-Challenges

Issue:

In one of our customer infra, for one of the VM  ,snapshot grown to 1.9 TB size and it was created by one of the engineer as part of IS upgradation but forgot to delete.

Environment

  • Hyper-v : 2012R2 cluster (4 Nodes)
  • 2 Volumes (Volume1 – 7 TB (Free space 1.18 TB), Volume2 -7 TB (980 GB free space) )
  • VM Role: Standalone critical VM where MS SQL(2008R2) databases are hosted and the size of all databases(100) is 1.3 TB.

Challenges:

  • VM level backups are not existed due to backup license issue , however regular database backup is happening with backup tool . But, as on  date SQL & backup team not tested restoration.
  • Expected additional free space from storage as snapshot deletion activity requires equivalent VHD free space -> Due to storage credentials issue, storage team unable to provide any support.

Due to above 2 challenges, planned below options and completed as part of prerequisites

  • Removed all unwanted files from Volume2 and made the free space of 1.6 TB in Volume2 so that while snapshot deletion(merging) it should not have space issue
  • Built new VM(SQL Server) and do database restoration to new SQL server -> This test is to estimate the restoration time and check database consistency

Implementation plan:

Prerequisites

  • As there is no VM level backup, backup team need to take FULL database backup & differential backup post downtime
  • Shutdown VM
  • Move CSV’s Volume 1 & 2 to the Hyper-v Server where VM hosted -> To provide better I/O
  • Make sure only one VM   is hosted on Hyper-v Server -> To provide better performance and we have sufficient resource to have only one VM

Implementation Plan:

  • Go To Hyper-V Manager -> Select VM -> Right Click ->Delete Snapshot 

Note: If Merge process taking more than expected, we cannot can cancel in between the merge process as there are lot of chance corruption

Roll Back Plan:

  • Backup team need to restore SQL databases directly to new VM which was prepared as Standby
  • Change the Hostname & IP to production
  • SQL team need to change the hostname at SQL Instance level
  • Application team need to check the connectivity

 

Before Execution Post Snapshot Deletion
VM File Size
 
VHD File Drives letters in OS Parent File Snapshot File Total VM Storage Volume  
 
Drive0.VHDX 83.3 46.8 130.1 87.2 Volume 1  
Drive1.VHDX E 1540 437.7 1977.7 1540 Volume 1  
Drive3.VHDX F 221.6 214.8 436.4 271.2 Volume 1  
Drive4.vhdx G 1950 1240 3190 1950 Volume 2  
3794.9 1939.3 5734.2 3848.4  
Time taken for Deletion of Snapshot 1.9 TB in Offline is  5 hrs. (12:30 to 5:30 A.M), space reclaimed is 1885.8 GB  

VM Backups failing on only one Node in a 2012R2 Cluster

Issue:

In a 5 node Hyper-v 2012R2 cluster,  all of sudden VM backups are failing on only one node(HOST2) i.e., backup team unable to take backup if any VM hosted on HOST2.

Observation:

  • When backup team is firing VM level backup on the HOST2, Backup is getting terminated with the VSS snapshot error..
  • If VM’s migrated to other node then backup is getting success for the same VM
  • Observed issue not specific to VM or any cluster shared volume -> Issue is occurring only if VM’s hosted on HOST2

Troubleshooting:

  • As issue specific to HOST2, tested VM backup with windows native backup tool -> Unable to take backup , terminating while creating VSS snapshot.
  • Created new VM on local D drive -> Tested with Windows backup tool -> Backup is getting success with windows backup tool if VM hosted on local drive, VM backup failing only if it is on Cluster shared storage
  • As issue specific to one server & CSV writer on HOST2 -> Started troubleshooting from the side of CSV writer
  • Done deep level analysis of event logs -> which indicates towards CSV writer unregistered -> Check below screenshot
  • Run the command “vssadmin list providers” on HOST2 and compared with other servers ->  it has been observed that  provider “Microsoft CSV Shadow Copy Provider”  is missing from HOST2 ->Screenshot attached
  • As CSV provider is missing on problematic HOST2 -> Fixed issue by exporting CLSID provider from working server and imported to HOST2 ->Check below screenshot
  • Post import , ran the “vssadmin list providers” -> Now provider list is same as working servers
  • Backup is working fine post fixing all..

Error Screenshots

 

 

Volume Shadow Copy Service (VSS) provides the ability to create a point in time image (shadow copy) that can be used to perform backups. In our environment, backup of VM failed immediately which was hosted on HOSt2 node, once it shows as “Snapshot Processing”. This means, snapshot operation is not happening. Provider ID(400a2ff4-5eb1-44b0-8a05-1fcac0bcf9ff) which is reflecting in Event viewer logs is related to MS CSV Shadow Copy Provider, which is not existing in registry editor as it might have unregistered.

Working Server(HOST3)

Not Working(HOST2) ->CLSID is missing

Final Screenshot

Ref:

 

One of the Hyper-V node in 2012R2 Cluster Changing to Pause Node Automatically

In one of my customer infra, we have 5 nodes in Hyper-v 2012R2 cluster. Among these 5 nodes , always Node1 changing to pause mode automatically for every 30 mins..

Issue:

Node1 is going to Pause State(With DO NOT FAIL ROLES BACK)  i.e., Node going to pause state without moving VM’s.

Observation:

  • Issue is getting resolve only after stopping of SCVMM agent service on BHHV-A01.
  • There was a recent migration happened(approx. 2 months) from Hyper-v 2012 to 2012 R2 and SCVMM 2012 to 2012R2
  • No schedule tasks were configured

In ideal scenario, Hyper-v will be go in pause mode only if administrator keep in maintenance mode or SCVMM will keep node in pause mode if it  is configured with  Dynamic optimization or PRO in SCVMM -> But, these settings are not configured in SCVMM

Issue looks very typically as only one node is having an impact and issue resolving if we stop SCVMM agent on Node1

I know that SCVMM is cluprit as issue resolving post stopping of SCVMM agent service -> I have asked customer to reinstall SCVMM agent on Node1 but he is not convinced.

Started searching SCVMM known issues in forums and found the below resolution

Solution:

It has been observed that, SCVMM was installed with RTM version in and there is a known  pause issue listed in Update Rollup 5.

Latest Rollup is Update Rollup 10 and below issue fixed in Rollup 5

Reference:

Pass-through disk addition in Highly Available VM – Difference in 2012 & 2008

Steps to add Pass-through Disk in Highly Available VM –  2012R2

  • Shutdown VM if it is powered on (Best Practice)
  • Make Sure Disk is online at HOST level and note down Disk Number
  • Go to Failover Console -> Add Disk to cluster ->After adding it will be placed in “Available Storage” ->Note the Disk number in console for later verification
  • Check whether Disk owner in failover console is displaying current working Server or not , else you need to proceed all steps by logging to the disk owner server.
  • In Failover Console ->Under Disk’s  section -> Right Click Disk ->  Assign to VM Role-> Select the VM which for which you want to assign.
  • After adding the disk to Failover Cluster, assign it to the VM role and ensure that the disk is online on HOST. If it is offline when you perform the remaining steps, the disk will be Read-Only in the VM with no way to fix it but to start over
  • In Failover ->Roles -> Go to VM -> Check under Resources Section -> Under Virtual Machine-> “Virtual Machine Configuration”  resource should online
  • In failover Console -> Go to VM Settings ->Add virtual SCIS adapter -> Pass Through Disk 4
  • Start VM ->Check Disk is accessible or not
  • Test Live Migration  

In 2008 or 2008 R2

DISK should be offline at HOST else it will go in READ-ONLY MODE  -> Blogs confirmed the same and I too seen the same issues

A new disk must be brought Online and Initialized before it can be used. This process writes a disk signature to the disk so cluster can use it. Once the disk has been initialized, it can be placed Offline again. No partitioning is required as that will be accomplished inside the virtual machine

Difference between adding pass-through disk in 2008 & 2012 is -> In 2008, Disk should be initialized and make offline whereas in 2012 it should be online throughout  the process

References

 

 

 

Pass-through Disk addition Issue in Cluster – Disk Read only issue after adding pass-through Disk

Issue

  • Unable to add Pass-through Disk in failover console to make Virtual Machine’s(2 VM’s) High Available with Pass-Through Disk.
  • Multiple VM’s had pass-through disk’s and no issue with any other.
  • Issue occurred  after one of my team member  removed pass-through disk post VM shutdown
  • Able to add pass-through disk without adding in to HA

Initial Troubleshooting

  • One of my team member removed pass-through disk and shutdown as part of planned maintenance activity – Post VM start , Disk went into Read-Only mode in guest O.S.
  • Due to less time,  without VM shutdown  I  tried to remove pass-through disk from VM ->Disk changing to Read-Only Mode
  • As disk changing to read-only mode, assumed that disk need to keep in offline at Host level ->Therefore, I had only option to change Disk to Turn-on Maintenance mode in failover console
  • In Failover console-> Kept Disk into Maintenance Mode -> Added pass-through disk to VM  in Failover Console -> Worked fine, Disk is in normal mode in Guest O.S.

Keeping the Disk into maintenance will not impact any functionality. Enabling of this mode will just disables few Disk checks like File/Device System Check’s , Is Alive, Look Alive etc. which performed by cluster service

Maintenance mode will remain on until one of the following occurs:

  • You turn it off.
  • The node on which the resource is running restarts or loses communication with other nodes (which causes failover of all resources on that node).

I have taken downtime as I need to Turnoff Disk Maintenance and resolve issue permanently

Next Troubleshooting:

  • Removed VM & Disk  from High Availability and Re-added to Failover Console -> No Luck
  • Moved VM to different Host server’s and tested the same steps to isolate issue from Host level ->No Luck
  • Created Test VM and executed similar to isolate issue from VM level ->No Luck
  • Tested by assigning Cluster disk’s with different servers to isolate issue from Disk ownership -> No luck
  • Tried Pass-through Disk by keeping in Disk Maintenance mode ( Previous state) -> No Luck
  • Removed VM & Disk from HA and added only in Hyper-v Manager -> It is working without High Availability 

Next Observations:

  • Before adding to cluster , when making DISK online -> Disk automatically coming with drive letter by appearing in windows Explorer -> Drive letter appearing as pass-through disk is not new(fresh), it is already using in production with drive letter so directly mounting.
  • When adding Cluster Disk in Failover console (let’s say  in HOST1) -> Disk ownership changing to HOST2 after adding to cluster -> This is the main difference which we made with current and  other VM’s  
  • Received Error while adding Pass -through Disk to VM  in Failover Console -> Error “An error occurred while updating the virtual machine configuration settings, Error code:0x8007100c, Not Supported”

Involved Microsoft support to check this typical issue, below are root cause & solution for the subjected issue

  • The UI(Failover Console) was trying to check the permission due to which we received an error on the disk which we are presenting as pass through as they are presented from SAN.
  • When we add the disk as pass through to the VM it gets added with the MPIO path of the disk. Due to which when we add it from the failover cluster manager to the VM it fails to update that path on the VM configuration file as it needs certain permissions, which it cannot see as we cannot add permission on the path \\?\mpio#disk&ven_dgc&prod_raid_5&rev_0532#1&7f6ac24}

Error:

‘Virtual Machine “DBL’ failed to start.

‘DBL’ failed to start. (Virtual machine ID XXXXXXXXXX)

‘DBL’ Synthetic SCSI Controller (Instance ID XXXXXXX): Failed to Power on with Error ‘General access denied error’ (0x80070005). (Virtual machine ID XXXXXXX)

‘DBL’: Hyper-V Virtual Machine Management service Account does not have permission to open attachment ‘\\?\mpio#disk&ven_dgc&prod_raid_5&rev_0532#1&}’. Error: ‘General access denied error’ (0x80070005). (Virtual machine ID XXXXXXX)

The highlighted is the path of the disk on which we cannot add the permission.

To force that path to get updated on the VM configuration file we have to run the PowerShell command.

update-clustervirtualmachineconfiguration -vmid XXXXXX-XXXX-XXXX

Above command updated the path succesfully in VM configuration, VM booted successfully

Error Screenshots

 

 

References:

How to add a Pass-through disk to a Highly Available Virtual Machine running on a Windows Server 2012 R2 Failover Cluster

Read-only pass-through disk after you add the disk to a highly available VM in a Windows Server 2008 R2 SP1 failover cluster

 

 

 

 

VM Registration Failure – VM Missing – VM Failed States in Failover Console

I have encountered below scenarios few months back  and issue addressed by following below approach

  1. VM appearing in Failover console and disappeared in Hyper-v Manager after unexpected reboot of Host server, it failed over and unregistered. -> Unable to find VM on any host in Hyper-v Manager
  2. After applying pass-through disk in failover console, it given access denied error but I started the VM to check -> VM tried to start in all nodes(16) and failed ->Post that, unable to start VM & open VM settings ->Unable to find VM on any Host in Hyper-v Manager

Troubleshooting

  • Restarted VMM service – No luck – Had a thought of reregistering with import option as event id said failed to unregister.
  • Imported With registration option From Hyper-v Manager – VM registered successfully

And for the 2nd case -> Registered  VM with Mklink command

  • Hyper-V operates using a list of symbolic links in a specific directory:  C:\ProgramData\Microsoft\Windows\Hyper-V\Virtual Machines
  • Each of these are links to the actual VM configuration files in their own respective subdirectories – whether stored locally or on shared storage, the link doesn’t change in its nature.
  • First you need to identify the GUID of the specific VM. 
  • As an example will use the LitwareSpeech VM, located at D:\VMs\LitwareSpeech.  In the “D:\VMs\LitwareSpeech\Virtual Machines” path is the configuration file for this VM, named “D546B942-76AF-4C3B-97C6-9EE74828BC91.XML”

Using the VM GUID that you determined above in Step 1, run the following command:

Syntax:   mklink <GUID>.XML <VMConfigPath.XML> or in our example

mklink D546B942-76AF-4C3B-97C6-9EE74828BC91.xml “D:\VMs\LitwareSpeech\Virtual Machines\D546B942-76AF-4C3B-97C6-9EE74828BC91.xml”

Above command restores the reference to your VM in Hyper-V Manager.

Ideally, when you create a VM Hyper-V creates a security entry (ACE) on this symbolic link for the SID of the worker process for the VM.  Unfortunately, this ACE isn’t re-created when you recreate the symbolic link using mklink as detailed above.

If you try to start your re-registered VM at this point, you may receive permission issue

To address this issue, follow these steps:

Using this GUID, run the following command to provide permissions

Syntax:  icacls “C:\ProgramData\Microsoft\Windows\Hyper-V\Virtual Machines\<GUID>.xml” /grant “NT VIRTUAL MACHINE\<GUID>“:(F) /L

icacls “C:\ProgramData\Microsoft\Windows\Hyper-V\Virtual Machines\D546B942-76AF-4C3B-97C6-9EE74828BC91.xml” /grant “NT VIRTUAL MACHINE\D546B942-76AF-4C3B-97C6-9EE74828BC91”:(F) /L

Above command regenerates the necessary ACE on the symbolic link using the Service SID of the VM, rather than on the configuration file itself, replicating the initial state of the symbolic link.

Once this command has been run successfully, you should be able to start your VM without further issues.

If above command steps does not work, you can try by stopping Hyper-v Virtual Machine Management Service

How to Unregister VM
  • In few scenarios, we may need to unregister
  • Delete the file symbolic link, will just deletes link and unregister from Hyper-v Manager. The VM and its configuration will be kept in your HDD
  • You can follow the above  steps to register VM

Error Screenshots

 

 

Delivery Controller vs Data Collector

 

Delivery Controller

Data Collector

No LHC LHC(Local Host Cache)
Connection Leasing No Connection Leasing
Pulls all information, static as well as dynamic from the central Site database Has static as well dynamic(run-time) information cached locally
There is no direct communication between delivery controllers.
No scheduled communication between the VDA’s and/or Site databases, only when needed.
Communicates with the IMA store, Peer  Data Collectors and its session Hosts(within its own zone) on a scheduled interval, or when a Farm configuration change has been made.
Is responsible for brokering and maintaining new and existing user session only. Often hosts user session, but can be configured as a dedicated data collector as well.
Can have a different operating system installed then the server and desktop VDA’s Need to have the same operating system as all other session hosts and DC’s within same Farm.
Core services installed only. The HDX stack is part of VDA software Has all the XenApp 6.5 or earlier bits and bytes fully installed.
Zones are optional. When configured they do need at least one Delivery controller present. Each Zone has one Data Collector. Having multiple data collector means multiple zones.
Election does not apply.
Deploy multiple, at least 2 Delivery controllers per site /zone (again one per zone is the minimum)
Can, and sometimes need to be elected. Configure at least one other Session Host per zone that can be elected as a Data Collector when needed.
When Central Site DB is down, no site wide configuration changes are possible. By default, Connection Leasing will kick in, enabling users to launch sessions which are assigned at least once during last 2 weeks prior to DB going offline. when IMA Db is down, no Farm wide configuration changes are possible. Everything else continues to work as expected due to the LHC present on the Data Collectors and Session Hosts in each Zone.
A Delivery controller can have a direct connection(API) with a Hypervisor or cloud platform of choice. Does not have any direct connection(API) with a Hypervisor or cloud platform management capabilities.
Almost all the communication directly flows through a Delivery Controller to Central Site DB. Session Hosts as well as Data Collectors directly communicate with IMA database.
VDA’s needs to successfully register themselves with a Delivery controller. When a XenApp server boots it needs a IMA service but it will not register itself anywhere.

 

Citrix IMA vs. FMA…

IMA back then…

FMA as it is today…

IMA – Independent Management Architecture. FMA – Flexcast Management Architecture
Farm. Site
Worker Group. Machine Catalog / Delivery Group.
Worker / Session Host / XenApp server. Virtual Delivery Agent (VDA). There is a
desktop OS VDA as well as a server OS
VDA, including Linux.
Data Collector (one per zone). Delivery Controller (multiple per Site).
Zones. Zones (as of version7.7)
Local Host Cache (LHC). Conenction Leasing
Delivery Services Console / App center. Citrix Studio (including StoreFront) and
Director.
EdgeSight monitoring (optional). Partly built into Director
Application folders. Application folders (new feature in 7.6) and
Tags (all 7.x versions).
IMA Data store. Central Site database (SQL only).
Load evaluators. Load management policies.
IMA protocol and service. Virtual Delivery Agents / TCP.
Farm Administrators. Delegated Site Administration using roles
and scopes, which are configurable as well.
Citrix Receiver. Citrix X1 Receiver. It will provide one
interface for both XenApp / XenDesktop as
well as XenMobile.
Smart Auditor. Session Recording.
Shadowing users. Microsoft Remote Assistance, launched
from Director.
USB 2.0. USB 3.0. Support
Session Pre-launch and Session Lingering. Session Pre-launch and Session Lingering.
Both have been re-introduced.
Power and capacity management. Basic power management from the GUI- advanced via powershell
Webinterface / StoreFront. Webinterface / StoreFront.
Single Sign-on for all or most of applications There is no separate Single Sign-On
component available for XenApp 7.x. This is
now configured using a combination of StoreFront, Receiver and policies.
Installed hotfixes inventory Installed hotfixes inventory from studio
Support for Windows Server 2003 and
2008R2.
FMA 7.x supports Windows Server 2008
R2 ,Server 2012 R2,2016

 

Hyper-V Live Migration terms: brownout, blackout and dirty pages

You may not know about brownouts, blackouts and dirty pages in Hyper-V Live Migrations, but they are useful for monitoring virtual machine live migration

Hyper-V Live Migration is undoubtedly the most sought-after feature of Hyper-V because of its ability to move virtual machines (VMs) between clustered hosts without noticeable service interruption. But in fact, Live Migration can cause brief disruptions in service that end users may not notice.

As an admin, you should understand some lesser-known Hyper-V Live Migration terms that help monitor and troubleshoot service interruption.

Hyper-V event logs contain information about live migration disruptions that can briefly affect VMs. For every VM live migration, these logs report three events: a brownout event, a blackout and dirty-pages event, and a summary of the live-migration process. Understanding these terms also helps you troubleshoot live migrations that take too long and prevent administrative tasks

You’ll find the Live Migration logs in the Application and Services Log -> Microsoft -> Windows -> Hyper-V-Worker

These Hyper-V Live Migration terms are numbered as follows:

Brownout event -22508
Blackout and dirty-pages event -22509
Black out event -20415 ( This is successful event id with blackout time)
Live migration summary event (22507)
Successful Live migration event 20418

hyper-v-livesuccess

 

A Live Migration brownout event 

A Hyper-V-Worker event log lists the brownout stage first. In the context of virtualization, a brownout is defined as the amount of time it takes to complete the memory-transfer portion of Hyper-V Live Migration. And the term brownout is a good metaphor for this event, because a VM is not affected completely (as the term blackout suggests). The VM is still responsive, but you can’t perform configuration changes or other administrative functions during this stage of the live migration

 hyper-v-live-brown

Above figure indicates that the brownout took 19.43 seconds. This time depends on the size of active RAM the VM uses and the speed of the Live Migration transport network. During this time, the VM is completely responsive as the memory pages move to the destination node. This stage of live migration gets most of a VM’s state over to another node, but not quite all. Since the VM is responsive, users most likely never know that a migration to another node is in process. But VM response may get delayed. You can monitor this delay by constantly pinging the VM with the command ping SERVERNAME –t. You’ll notice brief periods of longer response times, without total disruption of service

Live Migration blackout and dirty-pages event

The final stage in Hyper-V Live Migration is when a VM fully migrates to the destination node of the cluster. This process is called the blackout stage, where, to finally move a VM and all its memory, there is a brief pause in service. During the brownout stage, the host attempts to move all active memory to the destination node. But server memory isn’t completely emptied until this final process, where data is moved to the destination node. A final snapshot provides a last file representation of the remaining memory, which is known as dirty pages. When dirty pages are migrated to the destination node, the blackout occurs.

 

The blackout period is by no means comparable to the longer saved state in Hyper-V’s former Quick Migration feature, because Live Migration usually moves a very small amount of data during this final stage. But a slight disruption will occur, usually about one to two seconds, or one dropped ping. Unlike during the brownout stage, a VM is not responsive. The event log indicates how long the blackout period was and how many dirty pages were moved during the migration’s final stage (see below Figure)

 

hyper-v-live-blackout

Note that during a live migration for servers with a higher transaction workload, longer blackout times and a greater number of dirty pages occur.

These two Hyper-V Live Migration terms are important, because the blackout and dirty-pages event are troubleshooting tools. The log tells you for how long a VM was unavailable, which is useful information when a live migration takes longer than expected or when there is a noticeable disruption in service

Live Migration summary event

The final event, 22507, gives a nice summary of the duration of the live migration process

hyper-v-live-summary

Note: Above article written based on Hyper-v 2008 r2 and it is applicable to later version too..

Multiple Virtual Machine’s went into paused in a Hyper-v cluster throwing error “Disk(s) running out of space” – Typical Issue

Environment:
OS: Windows Server 2012 ,10 Node Hyper-v cluster
Model: ProLiant BL460c Gen8
Storage: FC,HP 3PAR
Error Message:
Multiple Virtual Machine’s went into paused state throwing error “Disk(s) running out of space though 30% free space available

Immediate Observations:

  • Only few Virtual machines went to paused state and observed all VM’s are coming from single volume let’s say Volume 1
  • Volume1 is having 30 % free space , checked with storage team whether the LUN is thick or thin as sometimes LUN may be exhausted if they allocated as thin. It is a thick LUN and no issues observed from storage
  • Tried to resume the VM, after sometime VM moving to pause state
  • Tried to move the CSV disk from one node to another – No Luck
  • Observed that there is no single recommended hotfix installed on any node in a cluster but all nodes are up to date with all critical patches

Resolution

  • Based on above findings, assumed that there is some known issues would be there as  error is misleading us to  DISK’s are running out of space though 30% free space available. 
  • As there was no single recommended hotfix  installed in cluster , initially installed Hype-v 2012 hotfixes on one node and taken restart.
  • Move the Victim volume to the recently restarted node. – No issues  observed
  • Post keeping an observation, installed all recommended hotfixes

Finally, I am unsure which hotfix resolved exactly but I have seen below forums where these issues addressed in KB2791729 but it is a private hotfix

Ref:
http://support.microsoft.com/kb/2784261 – Windows Server 2012 Hotfixes
« Older posts Newer posts »

© 2024 Tech Blog

Theme by Anders NorenUp ↑