Year: 2017 (Page 1 of 2)

Hyper-V VM Integration Services: List of Build Numbers

October 23, 2017 / Ram Prasad

Hyper-V integration services, are a bundled set of software which, when installed in the virtual machine improves integration between the host server and the virtual machine. Integration services (often called integration components), are services that allow the virtual machine to communicate with the Hyper-V host. Hyper-V Integration Services is a suite of utilities in Microsoft Hyper-V, designed to enhance the performance of a virtual machine’s guest operating system.

In short and general, the integration services are a set of drivers so that the virtual machine can make use of the synthetic devices provisioned to the VM by Hyper-V.

Hyper-V Integration Services optimizes the drivers of the virtual environments to provide end users with the best possible user experience. The suite improves virtual machine management by replacing generic operating system driver files for the mouse, keyboard, video, network and SCSI controller components. It also synchronizes time between the guests and host operating systems and can provide file interoperability and a heartbeat.

Below is the list of Integration Services Version numbers

Windows Server 2008

Build Number	Knowledge Base Article ID	Comment
6.0.6001.17101	n/a	Windows Server 2008 RTM
6.0.6001.18016	KB950050	Windows Server 2008 RTM + KB950050
6.0.6001.22258	KB956710	Windows Server 2008 RTM + KB956710
6.0.6001.22352	KB959962	Windows Server 2008 RTM + KB959962
6.0.6002.18005	KB948465	Windows Server 2008 Service Pack 2
6.0.6002.22233	KB975925	Windows Server 2008 RTM + KB975925

Windows Server 2008 R2

Build Number	Knowledge Base Article ID	Comment
6.1.7600.16385	n/a	Windows Server 2008 R2 RTM
6.1.7600.20542	KB975354	Windows Server 2008 R2 RTM + KB975354
6.1.7600.20683	KB981836	Windows Server 2008 R2 RTM + KB981836
6.1.7600.20778	KB2223005	Windows Server 2008 R2 RTM + KB2223005
6.1.7601.16562	n/a	Windows Server 2008 R2 Service Pack 1 Beta
6.1.7601.17105	n/a	Windows Server 2008 R2 Service Pack 1 RC
6.1.7601.17514	KB976932	Windows Server 2008 R2 Service Pack 1 RTM

Windows Server 2012

Build Number	Knowledge Base Article ID	Comment
6.2.9200.16384	n/a	Windows Server 2012 RTM
6.2.9200.16433	KB2770917	Windows Server 2012 RTM + KB2770917
6.2.9200.20655	KB2823956	Windows Server 2012 RTM + KB2823956
6.2.9200.21885	KB3161609	June 2016 update rollup for Windows Server 2012

Windows Server 2012 R2

Build Number	Knowledge Base Article ID	Comment
6.3.9600.16384	n/a	Windows Server 2012 R2 RTM
6.3.9600.17415	KB3000850	Windows Server 2012 R2 RTM + KB3000850
6.3.9600.17831	KB3063283	Windows Server 2012 R2 RTM + KB3063283
6.3.9600.18080	KB3063109	Windows Server 2012 R2 RTM + KB3063109
6.3.9600.18339	KB3161606	June 2016 update rollup for Windows Server 2012 R2
6.3.9600.18398	KB3172614	July 2016 update rollup for Windows Server 2012 R2
6.3.9600.18692	KB4022720	June 27, 2017—KB4022720 (Preview of Monthly Rollup)

Hyper-V BIN file removal to retain storage space

August 14, 2017 / Ram Prasad

The files used by Hyper-V VM are as below: In short, to explain:

.XML : This file contain VM configuration details
.VHD and .VHDX: These files are virtual disks that hold the current virtual disk data, including partitions and file systems.
.BIN : This file contains the memory of a virtual machine or snapshot that is in a saved state
.VSV: This file contains VM’s saved state.
.AVHD and .AVHDX: These files are differencing virtual disks, commonly used for snapshots and Hyper-V checkpoints

The BIN file created in the virtual machine folder of the virtual machine is equal to the size of the memory of the virtual machine and is a placeholder to save the virtual machine state in the event that the Hyper-V host shut down.

The BIN file contains the memory of a VM and is located inside the GUID folder. If the VM in powered off state, there will be no BIN file present. This file is the equal to the size of the VM’s memory provisioning in Hyper-V Manger.

In Windows Server 2008 and Windows Server 2008 R2 – starting a virtual machine would result in Hyper-V creating a .BIN file which matched the size of the memory assigned to the virtual machine. Microsoft did this to ensure that we always had enough disk space available to create a saved state (which is particularly critical if the physical computer is shutting down – and the virtual machine is configure to save state when the physical computer shuts down).

The BIN file is simply idle while the virtual machine is powered on; it is pre-allocated so that its space is guaranteed to be available if needed and for quicker response to a save action. However – many people did not like to see their disk space being “wasted” like this.., as BIN file is idle during running state.

To address this, since Windows “2012” Microsoft made a simple change: Hyper-V only pre-create the .BIN file if you choose “Save the virtual machine state” as the Automatic Stop Action for the virtual machine. If you choose “Turn off the virtual machine or Shut down the guest operating system”, BIN file will not create with equal size of RAM.

It is still possible to save the state manually as long as there is enough room for the file. Above Automatic Stop Action setting in only applicable when Physical computer shutdown.

By default, all virtual machines have an Automatic Stop Action of Save, which means the state of the virtual machine saved to disk. However, the best practice is once Integration Services are enabled the Automatic Stop Action should be changed to “Shut down the guest operating system”, which performs a clean shutdown and no longer needs the BIN file to save the memory content to.

Considerations:

Keeping BIN file is not recommended in a cluster environment as VM’s were configured in High Availability, in case of Physical computer shutdown, VM will failover to another anode hence there is no advantage of keeping BIN file.
Consider choosing BIN file if Hyper-v Servers are not in cluster (standalone) and no constraints with storage space.
VM move into saved state only when Hyper-v Host is gracefully shutdown and VM will not move to save state in case Hyper-v host is unexpected shutdown/restart.
Microsoft do not recommends keeping VM in saved state for the applications like Domain Controllers, Database, etc. Hence, change Automatic Stop Action to “Shut down” from “Save state” as per MS recommendations

Steps to save storage space by removing BIN File

VM need to be powered off
Go to VM Settings ->Automatic Stop Action -> Change the Option from “Save the virtual Machine state” to “Shut down the guest operating system”
Power on VM
Execute similar steps for each VM4

Note:

Above feature succesfully implemented at multiple customer environments which intern benefied customerin reclaiming Terabyte storage space

Local Host Cache Reintroduction– Long Awaited Feature

August 14, 2017 / Ram Prasad

Local Host Cache (LHC) & Evolution

Local Host Cache was a core feature of the Independent Management Architecture (IMA) that was introduced with Citrix Metaframe XP 1.0 in 2001, and was still used until Citrix XenApp 6.5 and now reintroduced in XenApp/Desktop 7.12

Technically, the LHC is a simple Access database where it stores a subset of the data store in each Presentation (XenApp) server. The IMA service running on each Presentation(XenApp) Server downloads the information for every 30 mins or whenever a configuration change is made in farm.

LHC primary functions are permits a server to function in the absence of a connection to the data store & improves performance by caching information of applications.

LHC contain the information of servers, published applications, Domain & Licensing. LHC evolved a lot over the years and allowed SQL downtimes for an indefinite period in its last release with XenApp 6.5.

If the data store is unreachable, the LHC contains enough information about the farm to allow normal operations for an indefinite period, if necessary. However, no new static information can be published, or added to the farm, until the farm data store is reachable and operational again.

The disappearance of LHC

With the release of the awful version 7.0 of XenApp in 2013 and the move to XenDesktop FlexCast Management Architecture (FMA), Citrix decided to remove the Local Host Cache feature–and many others–without offering any other alternative. To be fair, Citrix converged XenApp into XenDesktop, which was already using the FMA design since the version 5 and without Local Cache Host equivalent. This decision immediately made the SQL infrastructure a critical piece of any XenApp implementation. Any downtime on the SQL infrastructure would immediately cause a downtime for new sessions on the XenApp infrastructure as well. It could also have some side effects with the old Citrix Web Interface.

Citrix recommends having a highly available SQL infrastructure to host XenApp and XenDesktop databases. While you can successfully implement HA for your SQL infrastructure, it does not necessarily mean that you will avoid downtimes, as many components are to be considered.

The pseudo rebirth of LHC with Connection Leasing (CL)

Facing a storm of complaints, Citrix also started–finally!–to listen to its customers and released XenDesktop 7.6 in Sept 2014 with the Connection Leasing (CL) feature enabled by default.

Unfortunately, CL was not full replacement of LHC and it is alternative option provided in placement of LHC, limited to frequently used and assigned applications/desktops (up to 2 weeks by default). For users not using Citrix frequently or using pooled desktops, CL is completely useless and did not resolve anything. There are also many limitations: load management, workspace control, power actions are not supported.

The reintroduction of LHC

Citrix came up with a milestone achievement with its new idea as part of the XenDesktop 7.12 release in Dec 2016. This time, they claimed to bring back all the Local Host Cache (LHC) features from XenApp 6.5, even adding few improvements to make it more reliable. LHC feature is offered for Cloud and On Premises implementations along Connection Leasing in 7.12, but is considered the primary mechanism to allow connection-brokering operations when database connectivity to the site database is disrupted. Surprisingly, Local Host Cache feature is disabled by default. Let us expect Citrix to enable that feature by default in the next version.

When installing XenDesktop 7.12 and up, a SQL Express instance(Local DB) will be installed locally on each Delivery Controller to store the Local Host Cache. Config Synchronizer Service (CSS) takes care of the synchronization between the remote database and the Local Host Cache (Local DB). The Secondary Brokering Service (Citrix High Availability Service) takes over from the Principal Broker when an outage is detected and does all registration and brokering operations.

There are many limitations to consider with this version of LHC

Local DB, which is a runtime version of SQL Server with a specific licensing that limits the usage of four cores.
No support for Pooled desktops, which is a huge downside.
No change can be made to the farm (assignments, publications, power actions, etc.), you cannot even open the consoles (Director & Studio) and PowerShell
No control over the LHC election process and only a single Delivery Controller will take care of all VDA registrations and broker sessions for the whole zone during an outage which limits 5,000 VDAs per zone (not enforced)
Most importantly, it is only a one-way communication between the LHC and the remote SQL database
New version of the Local Host Cache would not assure you zero downtime. There is also a delay before users can actually connect .When the remote database goes down, VDAs still have to re-register to the newly and ONLY elected Delivered Controller. It can result in users not having icons in StoreFront or users not able to start new sessions for a short period.

In conclusion, it took Citrix almost 4 years to deliver a somewhat equivalent of the good old Local Host Cache for XenDesktop 7.x. The database is not a single point of failure anymore in a XenDesktop/XenApp deployment. However, customers with large deployments are not supported with this version of the Local Host Cache and some of the -HUGE- limitations can discourage you from using that feature

Ref:

https://support.citrix.com/article/CTX759510

http://docs.citrix.com/en-us/xenapp-and-xendesktop/7-12/manage-deployment/local-host-cache.html

PVS Streaming Service Abrupt Termination – Cache Change Mode Procedure for production vdisk

July 23, 2017 / Ram Prasad

Issue:

PVS stream service abrupt termination intermittently (approx. once in month) which causing user sessions freeze and user unable to launch HSD’s.

Environment :

2 Citrix PVS Servers (VM’s) with version 7.6

2000-3000 concurrent Users

86 HSD’s & 6 Golden Images

Microsoft Hypervisor 2012R2 ( 15 Node) – CICSO UCS

Observations:

Issue occurring once or twice in a month and there is no common pattern in days or hours,issue recurring in both PVS servers at a time
No changes in environment
Onsite engineer informed that issue existed since 3 months and issue getting resolved post restart of PVS servers.
One day, same issue repeated but issue not sorted out post restarting of PVS servers -> Issue escalated to support team (Me)
Observed Event Id 11 :”Detected one or more hung threads , DbAccess error: <Record was not found> <-31754> (in ServerStatusSetDeviceCount() called from SSProtocolLogin.cpp:2903” -> Indicates “Thread hangs under the stream service” & DB Access errors
Observed multiple vDisk retries on the problematic target devices. 11 at boot time and approximately 611 per hour during session
Observed recommended MacAfee exclusions are not in place -> Stopped MacAfee service and restarted PVS server -> PVS Streaming service stable for some time on one PVS server and again terminated ->Due to time constraint, logged a call with vendor(Citrix).
After 2 hrs, Citrix support joined the call and started collecting CDF races and procdump collection for the terminating stream service
After few hours , issue resolved automatically and Citrix support unable to find root cause with collected logs
In 2 months , issue repeated 2 times and customer frustrated as root cause was not found for abrupt streaming service termination intermittently.
Support Team (Myself) analyzed the environment and observed the Cache mode is configured as “ Cache on Server” which is not recommended for Production environment , Best practice to use “Cache on RAM overflow to HDD” which is a best practice to reduce load on PVS server & optimal performance ->Taken the same observation Citrix support and requested their observations

Explained to customer that missing of best practices will lead to these type of intermittent issues , since there is no root cause found and it is not a best practice to keep cache on server in production environment , prepared a plan to change cache configuration to” Cache on RAM overflow to HDD”.

Current PVS Storage configuration for cache as below

PVS1 (VM)->1700 GB allocated through Virtual HBA ( Total golden Image Sizes is 440 Gb & Remaining for Write Cache)

PVS2 (VM) -> 1700 GB allocated through Virtual HBA ( Total golden Image Sizes is 440 Gb & Remaining for Write Cache)

Proposed Storage change Configuration as below:

Post referring multiple blogs, Write Cache proposed to all images(profiles) is 20 GB -> Therefore , for 86 HSD, 1820 GB required and it should present to complete Hyper-v cluster as HSD hosted on cluster.

1820 GB -> To allocate Hyper-v Cluster ( To Create 20 GB Write Cache for each VM)
New LUN of 800 GB to PVS 1 (VM) -> To store Golden Images ( Taken new LUN for easy migration from old LUN and extra space taken for future requirement)
New LUN of 800 GB to PVS 2 (VM) -> To store Golden Images ( Taken new LUN for easy migration from old LUN and extra space taken for future requirement)
Post Cache change and migration, planned to release old LUNs’ ( 1700 & 1700 ) to storage team

How to change Cache Mode for existing production Vdisk:

Get new storage drive as explained above and assign the same drive letter in PVS 1 & PVS2 as it is under Load Balance.
Log into PVS Server 1 ( Initially do only changes in one PVS server).
Copy the Golden Image(.VHD & PVP files) to new drive.
Defragment the vdisk by right clicking the VHD and Mount -> This step is doing as per best practice to achieve optimal performance.
Create New Store in PVS Server and map the path to New drive.
Import the VdIsk to New Store.
Change vdisk to “Private Mode”.
Create new device or change existing device (To boot vdisk in Private Image mode).
Go to Device Properties in device collection, change the vdisk path ( to newly imported vdisk) ->Make sure to change the boot from Network to vdisk.
Go to VM Settings ->Remove IDE Controller-> DVD Drive ( As Write cache drive need to map as “D drive”).
Under IDE Controller -> Create one HDD 20 GB with Fixed drive to assign to HSD (To Write Cache drive to VM ).
Power on & Login to VM , format the drive with MBR ->Assign the drive letter “D” drive (Give Volume name as “Write Cache” & make sure your vdisk in Private mode by accessing vdisk tray icon).
Restart VM.
As an optional , it is best practice to redirect the Page File to Write Cache Drive ( Page File planned to keep as 4 GB post referring few blogs).
Post restart of VM, Go to Page files settings and configure page file(4GB) only to “D drive”.
Go to PVS Server -> vdisk pool -> Change the vdisk from Private to Standard (Note: Cache options will visible only in Standard Image Mode).
Change Vdisk Cache to “Cache on RAM with overflow to HDD” and assign 4 GB RAM in cache ( 4 GB cache is decided post referring few blogs).
Again boot HSD and observe the Cache status (It will be visible as Cache on RAM overflow to HDD and vdisk as Read-only).
Assign this vdisk to any Test VM(device) and ask users to test.
If Image is fine then shut down the devices attached to vdisk -> And copy VHD & PVP (Don’t copy LOK files) to second PVS server -> Make sure replication should be green

To replicate to other devices(VM’s)

Follow same steps 11th to 15^th for rest of the VM’s.
Reboot HSD.
Test the HSD accessibility & Cache configuration.

Conclusions

Since exact root cause was not found, vendor provided & self-analyzed would be the cause for abrupt streaming service termination

When each Target Device boots up the OS is not aware of the Write Cache and writes to the logical disk that it is presented (the vDisk). The PVS driver then redirects this data at the block level to the write cache which in your environment is held on the PVS server. When BNIStack driver ( transport stack for communicating with the PVS server) is then initialized, it will pull down chunks of the vDisk as and when they are needed from your vDisk store which is on your PVS server also. The BNIStack driver will also access the write cache directly when it needs to issue additional writes from the target device. The communication in relation to the above is carried out between the BNIStack driver on the target device and the Stream Service on the PVS server. The case notes state that environment have approximately 2000 users which I suspect is putting a large overhead on the Stream Service responsible for servicing all Write Cache requests.

The Target Device BNIStack driver is also responsible for retries (because UDP does not). The base timeout for a packet timeout is 10 seconds. If the server responds quickly, this value is reduced by half all the way down to 1 second. Correspondingly, if the server responds slowly the timeout will double all the way up to 10 seconds.

A retry timeout of 1 second or less may cause excessive I/O retries, leading to slow response and hanging of target devices which will then ultimately lead to a stream service failure. Since we know we are seeing issues with the stream service and subsequent hanging/crashing of this service and considering we are aware of the large write cache overheads incurred on the stream service already, I suspect that the load we are putting on the stream service is too great, culminating in it grinding to a halt, this will then present itself with the symptoms we see at the target device e.g. slow logging in, sluggishness inside the session.

As explained above, the recommendation to mitigate this is to move the Write Cache away from the PVS Server itself and place the processing overhead at the target device instead.

Recommendations & Reference

We have been observing the environment for about 2 months now and the issue has not occurred since we had disabled the antivirus services.
As a best practice, upgrade all environmental components (servers and target device software) to a required level, either PVS 7.6 LTSR CU3 OR PVS CR 7.13.
FYI: The recommendation for desktop operating systems start with 256-512MB and for server operating systems start with 2-4GB.
PVS RAM Cache consideration -> https://www.citrix.com/blogs/2015/01/19/size-matters-pvs-ram-cache-overflow-sizing/.
PVS Antivirus best practice -> https://support.citrix.com/article/CTX124185.

Citrix-XenApp-XenDesktop-XenServer Servicing Options

July 16, 2017 / Ram Prasad

Citrix provides servicing options to give greater flexibility and choice in how to adopt new XenApp, XenDesktop, and XenServer functionality while giving greater predictability for maintaining and managing the support of your environment

Last year, Citrix introduced two new XenApp / XenDesktop servicing options, the LTSR, which stands for Long Term Service Release and the CR a.k.a. Current Release., In 2016, Citrix announced first LTSR of XenApp and XenDesktop 7.6 and in 2017 first LTSR for XenServer 7.1 that is available for download on Citrix.com.

What is LTSR?

As a benefit of Software Maintenance, Long Term Service Releases (LTSR) of XenApp ,XenDesktop,XenServer enable enterprises to retain a particular release for an extended period of time while receiving minor updates that provide fixes, typically void of new functionality. Long Term Service Releases (LTSR) is ideal for large enterprise production environments where you would prefer to retain the same base version for an extended period

A Long Term Service Release guarantees 5 years of mainstream support and an optional 5 years of extended support (needs to purchased separately). This includes cumulative updates every 4 to 6 months, a new LTSR version of XenApp / XenDesktop every 12 to 24 months and any potential (hot) fixes

A valid Software Maintenance (SM) contract is needed to make use of the LTSR or CR servicing option.

Ideal customer environment for a LTSR is for the customers who typically follow a 3-5 year version upgrade cycle

Long Term Service Releases will have a regular cadence of Cumulative Updates that will typically contain only fixes

What is Current Release?

Any new release of XenApp/XenDesktop/XenServer will be labeled a Current Release. With the CR servicing option you can always make use of (install) the most recent XenApp and/or XenDesktop versions including all the latest enhancements and additions that come with it.

Its release cycles are much shorter with a new version release being announced every three to nine months in general.

Citrix recommends that large enterprise customers have a combination of Current Release and Long Term Service Release environments.

Switching from a LTSR to a CR servicing, and vice versa, is always optional as well

All initial releases of XenApp/XenDesktop/XenServer will be a Current Release. There will likely be multiple Current Releases of a major XenApp/XenDesktop/XenServer version (i.e. 7.6, 7.6 FP1, 7.6 FP2, 7.6 FP3, 7.7, 7.8 ,7.9,7.11,7.13,7.14); however, there will likely only be one LTSR release of that version after that release is considered customer-tested and industry-proven (i.e. 7.6 FP3).

How will the customer know if their environment is Long Term Service Release compliant?

Citrix support and engineering have developed the LTSR Assistant tool which will scan your environment and compare your environment with the necessary LTSR components to determine if you are compliant. The tool provides a report that will outline the necessary updates to achieve compliance. The LTSR Assistant tool is available for download athttp://support.citrix.com/article/CTX209577.

Will a customer running an LTSR compliant environment be supported if they also have non- compliant components?

Citrix does not recommend mixing non-compliant components. For example, if a customer decides to implement Provisioning Services 7.7, which is not compliant with the current 7.6 LTSR environment and the customer has an issue with Provisioning Services 7.7 the customer may be asked to move to the latest Provisioning Services Current Release to receive public fixes

How often will Citrix release a Long Term Service Release of XenApp and XenDesktop or XenServer?

Citrix will release a Long Term Service Release of XenApp and XenDesktop or XenServer based on the number of features, implementations, customer support cases and general feedback. However, as very general guidance it can be expected that Citrix will release a new Long Term Service Release every 12-24 months; however, Citrix reserves the full rights to alter those timelines.

Is Citrix discontinuing the process of providing Hotfix Rollup Packs (HRP) for XenApp and XenDesktop?

With LTSR, Cumulative Updates will replace Hotfix Rollup Packs (HRP). Hotfix Rollup Packs (HRP) will still be made available for XenApp 6.5.

Will 7.6 LTSR support XenApp for Windows Server 2008 R2 for 10 years?

Windows Server 2008 R2 will not be eligible for extended support. Citrix will continue to monitor Windows 2008 R2 lifecycle dates for future determination of lifecycle milestones.

Hyperv-VM Snapshot Deletion Activity -1.9 TB-Challenges

May 7, 2017 / Ram Prasad

Issue:

In one of our customer infra, for one of the VM ,snapshot grown to 1.9 TB size and it was created by one of the engineer as part of IS upgradation but forgot to delete.

Environment

Hyper-v : 2012R2 cluster (4 Nodes)
2 Volumes (Volume1 – 7 TB (Free space 1.18 TB), Volume2 -7 TB (980 GB free space) )
VM Role: Standalone critical VM where MS SQL(2008R2) databases are hosted and the size of all databases(100) is 1.3 TB.

Challenges:

VM level backups are not existed due to backup license issue , however regular database backup is happening with backup tool . But, as on date SQL & backup team not tested restoration.
Expected additional free space from storage as snapshot deletion activity requires equivalent VHD free space -> Due to storage credentials issue, storage team unable to provide any support.

Due to above 2 challenges, planned below options and completed as part of prerequisites

Removed all unwanted files from Volume2 and made the free space of 1.6 TB in Volume2 so that while snapshot deletion(merging) it should not have space issue
Built new VM(SQL Server) and do database restoration to new SQL server -> This test is to estimate the restoration time and check database consistency

Implementation plan:

Prerequisites

As there is no VM level backup, backup team need to take FULL database backup & differential backup post downtime
Shutdown VM
Move CSV’s Volume 1 & 2 to the Hyper-v Server where VM hosted -> To provide better I/O
Make sure only one VM is hosted on Hyper-v Server -> To provide better performance and we have sufficient resource to have only one VM

Implementation Plan:

Go To Hyper-V Manager -> Select VM -> Right Click ->Delete Snapshot

Note: If Merge process taking more than expected, we cannot can cancel in between the merge process as there are lot of chance corruption

Roll Back Plan:

Backup team need to restore SQL databases directly to new VM which was prepared as Standby
Change the Hostname & IP to production
SQL team need to change the hostname at SQL Instance level
Application team need to check the connectivity

Before Execution					Post Snapshot Deletion VM File Size
VHD File	Drives letters in OS	Parent File	Snapshot File	Total VM		Storage Volume
VHD File	Drives letters in OS	Parent File	Snapshot File	Total VM		Storage Volume
Drive0.VHDX	C	83.3	46.8	130.1	87.2	Volume 1
Drive1.VHDX	E	1540	437.7	1977.7	1540	Volume 1
Drive3.VHDX	F	221.6	214.8	436.4	271.2	Volume 1
Drive4.vhdx	G	1950	1240	3190	1950	Volume 2
		3794.9	1939.3	5734.2	3848.4
Time taken for Deletion of Snapshot 1.9 TB in Offline is 5 hrs. (12:30 to 5:30 A.M), space reclaimed is 1885.8 GB

VM Backups failing on only one Node in a 2012R2 Cluster

March 26, 2017 / Ram Prasad

Issue:

In a 5 node Hyper-v 2012R2 cluster, all of sudden VM backups are failing on only one node(HOST2) i.e., backup team unable to take backup if any VM hosted on HOST2.

Observation:

When backup team is firing VM level backup on the HOST2, Backup is getting terminated with the VSS snapshot error..
If VM’s migrated to other node then backup is getting success for the same VM
Observed issue not specific to VM or any cluster shared volume -> Issue is occurring only if VM’s hosted on HOST2

Troubleshooting:

As issue specific to HOST2, tested VM backup with windows native backup tool -> Unable to take backup , terminating while creating VSS snapshot.
Created new VM on local D drive -> Tested with Windows backup tool -> Backup is getting success with windows backup tool if VM hosted on local drive, VM backup failing only if it is on Cluster shared storage
As issue specific to one server & CSV writer on HOST2 -> Started troubleshooting from the side of CSV writer
Done deep level analysis of event logs -> which indicates towards CSV writer unregistered -> Check below screenshot
Run the command “vssadmin list providers” on HOST2 and compared with other servers -> it has been observed that provider “Microsoft CSV Shadow Copy Provider” is missing from HOST2 ->Screenshot attached
As CSV provider is missing on problematic HOST2 -> Fixed issue by exporting CLSID provider from working server and imported to HOST2 ->Check below screenshot
Post import , ran the “vssadmin list providers” -> Now provider list is same as working servers
Backup is working fine post fixing all..

Error Screenshots

Volume Shadow Copy Service (VSS) provides the ability to create a point in time image (shadow copy) that can be used to perform backups. In our environment, backup of VM failed immediately which was hosted on HOSt2 node, once it shows as “Snapshot Processing”. This means, snapshot operation is not happening. Provider ID(400a2ff4-5eb1-44b0-8a05-1fcac0bcf9ff) which is reflecting in Event viewer logs is related to MS CSV Shadow Copy Provider, which is not existing in registry editor as it might have unregistered.

Working Server(HOST3)

Not Working(HOST2) ->CLSID is missing

Final Screenshot

Ref:

https://technet.microsoft.com/en-us/library/ee264196(v=ws.10).aspx

One of the Hyper-V node in 2012R2 Cluster Changing to Pause Node Automatically

March 21, 2017 / Ram Prasad

In one of my customer infra, we have 5 nodes in Hyper-v 2012R2 cluster. Among these 5 nodes , always Node1 changing to pause mode automatically for every 30 mins..

Issue:

Node1 is going to Pause State(With DO NOT FAIL ROLES BACK) i.e., Node going to pause state without moving VM’s.

Observation:

Issue is getting resolve only after stopping of SCVMM agent service on BHHV-A01.
There was a recent migration happened(approx. 2 months) from Hyper-v 2012 to 2012 R2 and SCVMM 2012 to 2012R2
No schedule tasks were configured

In ideal scenario, Hyper-v will be go in pause mode only if administrator keep in maintenance mode or SCVMM will keep node in pause mode if it is configured with Dynamic optimization or PRO in SCVMM -> But, these settings are not configured in SCVMM

Issue looks very typically as only one node is having an impact and issue resolving if we stop SCVMM agent on Node1

I know that SCVMM is cluprit as issue resolving post stopping of SCVMM agent service -> I have asked customer to reinstall SCVMM agent on Node1 but he is not convinced.

Started searching SCVMM known issues in forums and found the below resolution

Solution:

It has been observed that, SCVMM was installed with RTM version in and there is a known pause issue listed in Update Rollup 5.

Latest Rollup is Update Rollup 10 and below issue fixed in Rollup 5

Reference:

https://support.microsoft.com/en-in/kb/3023195

https://blogs.technet.microsoft.com/scvmm/2016/05/25/update-rollup-10-for-system-center-2012-r2-virtual-machine-manager-is-now-available/

https://blogs.technet.microsoft.com/scvmm/2013/02/19/reducing-your-power-consumption-with-scvmm-2012-and-power-optimization-po/

Pass-through disk addition in Highly Available VM – Difference in 2012 & 2008

March 18, 2017 / Ram Prasad

Steps to add Pass-through Disk in Highly Available VM – 2012R2

Shutdown VM if it is powered on (Best Practice)
Make Sure Disk is online at HOST level and note down Disk Number
Go to Failover Console -> Add Disk to cluster ->After adding it will be placed in “Available Storage” ->Note the Disk number in console for later verification
Check whether Disk owner in failover console is displaying current working Server or not , else you need to proceed all steps by logging to the disk owner server.
In Failover Console ->Under Disk’s section -> Right Click Disk -> Assign to VM Role-> Select the VM which for which you want to assign.
After adding the disk to Failover Cluster, assign it to the VM role and ensure that the disk is online on HOST. If it is offline when you perform the remaining steps, the disk will be Read-Only in the VM with no way to fix it but to start over
In Failover ->Roles -> Go to VM -> Check under Resources Section -> Under Virtual Machine-> “Virtual Machine Configuration” resource should online
In failover Console -> Go to VM Settings ->Add virtual SCIS adapter -> Pass Through Disk 4
Start VM ->Check Disk is accessible or not
Test Live Migration

In 2008 or 2008 R2

DISK should be offline at HOST else it will go in READ-ONLY MODE -> Blogs confirmed the same and I too seen the same issues

A new disk must be brought Online and Initialized before it can be used. This process writes a disk signature to the disk so cluster can use it. Once the disk has been initialized, it can be placed Offline again. No partitioning is required as that will be accomplished inside the virtual machine

Difference between adding pass-through disk in 2008 & 2012 is -> In 2008, Disk should be initialized and make offline whereas in 2012 it should be online throughout the process

References

How to add a Pass-through disk to a Highly Available Virtual Machine running on a Windows Server 2012 R2 Failover Cluster

Click to PDF

Adding a Pass-through Disk to a Highly Available Virtual Machine

Click to PDF

Pass-through Disk addition Issue in Cluster – Disk Read only issue after adding pass-through Disk

March 18, 2017 / Ram Prasad

Issue

Unable to add Pass-through Disk in failover console to make Virtual Machine’s(2 VM’s) High Available with Pass-Through Disk.
Multiple VM’s had pass-through disk’s and no issue with any other.
Issue occurred after one of my team member removed pass-through disk post VM shutdown
Able to add pass-through disk without adding in to HA

Initial Troubleshooting

One of my team member removed pass-through disk and shutdown as part of planned maintenance activity – Post VM start , Disk went into Read-Only mode in guest O.S.
Due to less time, without VM shutdown I tried to remove pass-through disk from VM ->Disk changing to Read-Only Mode
As disk changing to read-only mode, assumed that disk need to keep in offline at Host level ->Therefore, I had only option to change Disk to Turn-on Maintenance mode in failover console
In Failover console-> Kept Disk into Maintenance Mode -> Added pass-through disk to VM in Failover Console -> Worked fine, Disk is in normal mode in Guest O.S.

Keeping the Disk into maintenance will not impact any functionality. Enabling of this mode will just disables few Disk checks like File/Device System Check’s , Is Alive, Look Alive etc. which performed by cluster service

Maintenance mode will remain on until one of the following occurs:

You turn it off.
The node on which the resource is running restarts or loses communication with other nodes (which causes failover of all resources on that node).

I have taken downtime as I need to Turnoff Disk Maintenance and resolve issue permanently

Next Troubleshooting:

Removed VM & Disk from High Availability and Re-added to Failover Console -> No Luck
Moved VM to different Host server’s and tested the same steps to isolate issue from Host level ->No Luck
Created Test VM and executed similar to isolate issue from VM level ->No Luck
Tested by assigning Cluster disk’s with different servers to isolate issue from Disk ownership -> No luck
Tried Pass-through Disk by keeping in Disk Maintenance mode ( Previous state) -> No Luck
Removed VM & Disk from HA and added only in Hyper-v Manager -> It is working without High Availability

Next Observations:

Before adding to cluster , when making DISK online -> Disk automatically coming with drive letter by appearing in windows Explorer -> Drive letter appearing as pass-through disk is not new(fresh), it is already using in production with drive letter so directly mounting.
When adding Cluster Disk in Failover console (let’s say in HOST1) -> Disk ownership changing to HOST2 after adding to cluster -> This is the main difference which we made with current and other VM’s
Received Error while adding Pass -through Disk to VM in Failover Console -> Error “An error occurred while updating the virtual machine configuration settings, Error code:0x8007100c, Not Supported”

Involved Microsoft support to check this typical issue, below are root cause & solution for the subjected issue

The UI(Failover Console) was trying to check the permission due to which we received an error on the disk which we are presenting as pass through as they are presented from SAN.
When we add the disk as pass through to the VM it gets added with the MPIO path of the disk. Due to which when we add it from the failover cluster manager to the VM it fails to update that path on the VM configuration file as it needs certain permissions, which it cannot see as we cannot add permission on the path \\?\mpio#disk&ven_dgc&prod_raid_5&rev_0532#1&7f6ac24}

Error:

‘Virtual Machine “DBL’ failed to start.

‘DBL’ failed to start. (Virtual machine ID XXXXXXXXXX)

‘DBL’ Synthetic SCSI Controller (Instance ID XXXXXXX): Failed to Power on with Error ‘General access denied error’ (0x80070005). (Virtual machine ID XXXXXXX)

‘DBL’: Hyper-V Virtual Machine Management service Account does not have permission to open attachment ‘\\?\mpio#disk&ven_dgc&prod_raid_5&rev_0532#1&}’. Error: ‘General access denied error’ (0x80070005). (Virtual machine ID XXXXXXX)

The highlighted is the path of the disk on which we cannot add the permission.

To force that path to get updated on the VM configuration file we have to run the PowerShell command.

update-clustervirtualmachineconfiguration -vmid XXXXXX-XXXX-XXXX

Above command updated the path succesfully in VM configuration, VM booted successfully

Error Screenshots

References:

How to add a Pass-through disk to a Highly Available Virtual Machine running on a Windows Server 2012 R2 Failover Cluster

Read-only pass-through disk after you add the disk to a highly available VM in a Windows Server 2008 R2 SP1 failover cluster

The pseudo rebirth of LHC with Connection Leasing (CL)

The reintroduction of LHC

Latest News