Page 7 of 8

In 15 Node cluster, majority of the virtual machines failed over(restarted) due to network fluctuations while performing network maintenance activity

November 26, 2016 / Ram Prasad

Environment:

OS: Windows Server 2012

Model: IBM Flex System 8721 (Chasis) Hyper-v Servers are in 2 Chassis

Network : 2 (Public & private), 2 different Switches

Immediate Observations:

During network core switch activity, there was a network disturbance for both public & private network interfaces for 30 to 60 sec’s and it is fluctuated in duration of 2 hours
Observed the event id’s 1127(network fail),1135(Removal of cluster membership),1177(quorum lost) & 5120(CSV disconnection)

hyper-v-event-1127

hyper-v-event-csv-disconnection

hyper-v-event-1135

hyper-v-event-1177

Immediate Action’s performed

As heartbeat is not available & other interface not interrupting , we have increased the heartbeat value by increasing the Same Subnet Threshold from default 5 sec to 28 Sec. This value is kept by referring the below blog
As per Microsoft recommendation we should not have samesubnetthreshold value > 20 sec. Since we observed VLAN flapping is taking 25 sec, we have set the value to 28 to check , we have changed samesubnetThreshold to 28 which is heartbeat value. However we can’t set routehistorylength value more than 40 due to limitation.

hyper-vclus

SameSubnetDelay = 1000 ( means 1000 millsec i.e, 1 sec)

SameSubnetThreshold = 5 (Means , 5 heartbeats)

So, By default total heart beat in a cluster is = Delay * Threshold = 1 sec * 5 HB i.e. cluster can tolerate of 5 HB in 5 sec

Delay – This defines the frequency at which cluster heartbeats are sent between nodes. The delay is the number of seconds before the next heartbeat is sent.

Threshold – This defines the number of heartbeats which are missed before the cluster takes recovery action

For example setting SameSubnetDelay to send a heartbeat every 2 seconds and setting the SameSubnetThreshold to 10 heartbeats missed before taking recovery, means that the cluster can have a total network tolerance of 20 seconds(2sec *10 HB) before recovery action is taken

hyper-v-disclaimer

Changing of above 28 HB’s would not help us our requirement as network fluctuations are more than 30 sec hence we have sought Microsoft support for any other best practice and got below recommendations

The samesubnedelay and samesubnetthreashold values are very specific to heartbeat settings between cluster nodes/hosts. These changes will help delay the heartbeat checks between the nodes/hosts.
The above changes will not control the way the SMB multi-channel that is used by the Cluster Shared Volumes (CSV). The moment the TCP connection is dropped during the network maintenance activity, the SMB channel will have an impact.
As the SMB connection drops, it will affect the VMs hosted on the CSV volumes. They will not be able to get the metadata for the CSV over the SMB channel.
Due to problems over CSV network (SMB channel), you would see event ID 5120 for CSV volumes. This will impact the VM availability.

So based on the above points, if there is a network outage beyond 10-20 seconds, there will be impact to the cluster and there is no way to avoid this impact on the VM resources. It is recommended to ensure the VMs are moved to the nodes where will be no network impact or to be brought offline gracefully before the network maintenance activity.

Also, MS responded as below for the query which we asked to check and it may not viable in CSV environment as SMB connection drops impacting CSV

Removal and adding of the VM from HA is time consuming and not an easy option and Microsoft do not suggest this option in CSV environment.

Ref: https://blogs.msdn.microsoft.com/clustering/2012/11/21/tuning-failover-cluster-network-thresholds/

All the VM’s are failing over(restarting) unexpectedly from one node to another in 15 Node Hyper-v cluster, post storage firmware up gradation.

November 26, 2016 / Ram Prasad

Environment:
OS: Windows Server 2012
Model: IBM Flex System 8721 (Chassis)
Hyper-v Servers are 2 Chassis
Storage : IBM SVC7000 over FC
Multipathing existed with IBM DSM

Immediate Observations:
  VM's were impacted post storage firmware upgradation activity
  Post activity completion of 2-4 hours , observed the event id 5120(Status_IO_TimeOut)  & 5142 for all CSV's at different  timings
  Observed continuous event id's 129 & 153 on all Hyper-v base servers from the time storage activity started

hyper-v-event-5120

hyper-v-event-5142

hyper-v-event-129

hyper-v-event-153

hyper-v-event-5

Immediate Action’s performed

Planned to start rebooting of all Hyper-v servers one by one ,initially started rebooting of Coordinator node where the Hyper-v is owning the CSV disk to release the locks and to control the VM’s failover immediately.
Post rebooted of Hyper-v hosts , started moving CSV disk to the server which we rebooted. Post starting of 3 or 4 Hyper-v servers, VM’s failover is controlled . However, observed few VM’s were not able to move or failover manually due to lock’s.
Therefore , as a good practice restarted all Hyper-v servers so that storage paths will be reestablished without any issues.

Post resolving the issues, we started to find the root cause of multipathing failure

We have analyzed as below based on the above event id’s 129,153,5120 & 5142.

Each Cluster node will have direct access to a CSV LUN as well as redirected access over the network and through the node that is the coordinator(owner) of the CSV resource. 5120 errors indicate a failure of redirected I/O, and a 5142 indicates a failure of both redirected and direct.

Warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the Source. Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request. Dropped requests can be caused by faulty routers or other hardware problems on the SAN. If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network

An event 153 is similar to an event 129. An event 129 is logged when the storport driver times out a request to the disk. The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request.

The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.

Finally we clearly understood that , between MPIO (IBM DSM) & HBA driver there was a connectivity issue somewhere in the storage stack driver and involved Storage vendor to do deep analysys from storage end.

From Storage team, we came to know that before storage upgradation activity , Read/Write abnormalities found on volumes i.e, huge Read/write latency found, however they fixed the same before upgradation.

By above statement & referring few blogs , we understood that , in the Draining state volume pends all new IOs and any failed IOs. As storage vendor confirmed that the read/write latency on volumes found abnormal, it would have caused delay in completing I/O for CSV volume and went in to pause state/IO Timeout errors.

There is one timer per logical unit and it is initialized to -1. When the first request is sent to the miniport the timer is set to the timeout value in the SRB.

The timer is decremented once per second. When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue. So, as long as requests complete the timer will never go to zero. If the timer does go to zero, it means the device has stopped responding. That is when the STORPORT driver logs the Event ID 129 error. STORPORT then has to take corrective action by trying to reset the unit.

Also, it is recommended to upgrade HBA driver as it is oldest and CVSFLT.sys,CVSFS.sys by following KB3013767

Ref:

https://blogs.msdn.microsoft.com/ntdebugging/2011/05/06/understanding-storage-timeouts-and-event-129-errors/

https://blogs.msdn.microsoft.com/clustering/2014/12/08/troubleshooting-cluster-shared-volume-auto-pauses-event-5120/

https://blogs.msdn.microsoft.com/clustering/2014/02/26/event-id-5120-in-system-event-log/

Unable to Start any VM in one of the Hyper-V node cluster

October 23, 2016 / Ram Prasad

Issue:
Unable to Start any VM in one of the Hyper-V node(Windows Server 2012) cluster.

Observations

Issue started post upgradation of Symantec, Symantec upgradation is not successful
Unable to migrate any VM from another node in a cluster
Unable to start any VM in that node. After clicking start, VM is in Starting stage and throwing error after 2-5 minutes.

Error Messages Seen:
TEST failed to start worker process: “server execution failed (0x80080005)”

Troubleshooting done:

Given full control to Everyone on below regkey: HKCR\AppID\{8BC3F05E-D86B-11D0-A075-00C04FB68820}
Post giving permission, able to start VM on the host.
Compared the registry key permissions with other working machine and found Creator Owner was missing on problem machine. I added it and removed the everyone. Able to start VM
To fix WMI and DCOM errors, I re-registered the dlls by running below command ( for /f %s in (‘dir /b *.dll’) do regsvr32 /s %s ) to resolve the issue.

Attached is the detailed document with all screenshots. Unable to Start VM -Doc

Issues encountered post deployment of Netscaler VPX 10.5

July 16, 2016 / Ram Prasad

Issues encountered post deployment of Netscaler 10.5

Requirement:

Customer imported NetScaler 10.5 VPX to Hyper-v and requested us to configure further configurations

Issue 1:Netscaler URL is not opening over internet

Observations & changes done:

Netscaler has 3 Interfaces ( DMZ, LAN Zone & Loopback)

Netscaler Interface

Netscaler IP’s as below

Netscaler IP

172.16.8.X is DMZ Virtual IP. It should be properly natted to public IP 192.X.X.X, then only Netscaler Access gateway web page will open over internet.
Network Team will do internal routes from 172.16.8.X to core switches so that it will reach to Citrix infra servers
Note that ,172.16.8.x is the virtual IP which you will configure in Gateway virtual server
Make sure that 80(STA Port),443(STA Port) ,1494 & 2598 ports opened bidirectional from Netscaler Virtual IP(172.16.8.X) to Citrix infrastructure servers

After above configurations, netscaler web page opening over internet but observed certificate errors and Authentication issue

Issue 2:

User getting error that the credentials are incorrect when logging to Netscaler

Resolution:

The LDAP configuration was not as per the article http://support.citrix.com/article/CTX108876 correcting which rectified the behavior of incorrect username password.

Netscaler LDAP-1

Netscaler LDAP-2

Netscaler LDAP-3

Issue 3: Certificates errors on Netscaler.

Observations & changes done:

Observed intermediate & root certificates are missing in NetScaler which creating authentication issues too..
From Client end they are able to get authenticating prompt but not able to get establishing the full session
Using the openssl command we have verified that the certificate chain is complete and linked on the VPN virtual server on Netscaler Gateway.
- # /usr/bin/openssl s_client -connect <ip:port> -showcerts

As per article http://support.citrix.com/article/CTX114146

Issue 4:

VDI launching is working with internal URL and not working externally, throwing VDI error

Observations & Changes done

Observed session polices were incorrectly configured, created 2 session policies (Web & Receiver Policy)

Using the article http://support.citrix.com/article/CTX139963

Netscaler Session-1

Netscaler Session-2

Netscaler Session-3

Netscaler Session-4

For Receiver, need to configure account services address (Similar to Xenapp Services URL)

Netscaler Session-5

Issue 5:

Error: Cannot complete request, before log into Netscaler webpage and issue is same from internal URL too.

Observations:

Load balancing Virtual name(VDIDesktopxx.locaL) is configured in Session profile but these load balancing VIP (SF1+Sf2) were hosted on separate load balancer and there was some issue with load balancing VIP
Customer removed Storefront load balancing IP configuration , informing us to point one storefront(SF1) only in Netscaler.
Post Load balancing configuration removal, we got the error “Cannot complete request” as netscaler is unable to find the load balance IP

Changes done:

Certificate was binded with local load balancing virtual name(VDIDesktopxx.locaL) hence to maintain the same , we created alias entry for SF1 server so that same URL will be accessed internally and the same reachable from netscaler
Observed XML was set to false in DDC, recommended to make it true so ran the command set-brokersite -TrustRequestsSentToTheXmlServicePort $true

After doing all above changes, Users are able to launch VDI externally and internally without any issues

XenApp- Applications are unable to launch from DR Web Interface server’s

July 6, 2016 / Ram Prasad

Issue:

Applications are unable to launch from DR Web Interface server’s.

Troubleshooting:

Troubleshooting started with notepad application by mapping to different Xenapp Servers,Web Interface and Zone Data collectors from Pune & Delhi.
Issue observed at DR Zone data collector’s(ZDC) as Qfarm /load does not returning any value when we run from both ZDC’s
As there is no value returned from ZDC, suspected that ZDC is not contacting database for loading dynamic information.
Observed that DR ZDC MF20.dsn(Database connection file) is pointing to the Pune SQL Database – This is incorrect as it is single FARM & FARM database is active in Delhi SQL.

Solution:

Reconfigured Pune ZDC02 server to Delhi SQL database by running the dsmaint config command with new username/password
After reconfiguring MF20.dsn file, Zone data collector returning load values when executing qfarm /load and launching applications without any issues

Observations & Recommendation’s :

As FARM will connect to only one database , we need to restore the latest backup copy of production database if there is no synch between primary & DR sql servers and reconfigure MF20.dsn during DR Drill -> This is significant step during DR drill
SQL mirroring can configure from production to DR SQL Servers to avoid above step.
No Hotfixes are installed, need to install hotfix Rollup pack similar to production or latest -> This is critical to avoid known issues

Enabling Jumbo Frames on CISCO UCS blades -Hyperv

July 6, 2016 / Ram Prasad

How to enable Jumbo Frames for Hyper-v hosted on CISCO UCS blades

Jumbo Frames setting can enable from UCS manager and no need to perform any changes from windows end if servers hosted on CISCO UCS blades

You need to make 3 changes:

Set the System Class MTU to 9216
Create a QoS policy for the MTU
Set the vNIC to have 9000 MTU and QoS policy you have created

To configure Jumbo Frames on UCS it is done as a QoS policy and the configuration guide is in the link below:

http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/gui/config/guide/2-2/b_UCSM_GUI_Configuration_Guide_2_2/configuring_quality_of_service.html

Whilst you are planning to use Hyper-V as your OS, the following configuration guide is quite useful to understand which components on the UCS you need to configure to enable Jumbo Frames:

http://www.cisco.com/c/en/us/support/docs/servers-unified-computing/ucs-b-series-blade-servers/117601-configure-UCS-00.html

Find the document with screenshots

Document -Jumbo Frames enablement-CISCO UCS

XenDesktop Controller Hotfix Update Procedure

July 6, 2016 / Ram Prasad

Implementation Plan

Take a Full backup of Citrix Databases on server locally and tapes.
Take a snapshot of DDC01 (Controller 1)
Download and Install Hotfix update 1(CTX135207) on DDC01(Controller 1)
Reboot DDC01
Test VDI by stopping the services in DDC02 so that session will be established to DDC01.
Take a snapshot of DDC02 (Controller 2)
Install Hotfix update 1 on DDC02(Controller 2) – Similar procedure of DDC01
Reboot DDC02
Test VDI by stopping the services in DDC01 so that session will be established to DDC02
Observe for 1 week and remove snapshots.

Roll Back

Uninstall the component from ARP/Programs and Features.
Restore the data store as described in Knowledge Center article CTX135207.
Install the desired level of the component (base or other hotfix).
Restart the Controller even if not prompted to do so

Revert the snapshot which was taken before installation

Find the document with screenshot in attachment

Document – Xen Desktop7.5 Hotfix Update Installation Procedure

Additional Information

Users are unable to launch the applications, license errors were seen during launching of application

June 26, 2016 / Ram Prasad

Observation:

Citrix License errors were seen while logging to the server through RDP (screenshot 1)
Citrix Licensing was in stopped state, but there was established ICA session for the both Citrix Servers.
Tried to start the service manually, it thrown error with error code 1067. (screenshot 2)
Found application error with code 1000, for lmadmin(screenshot 3)
SA license expired

Work Around Solution:

Renamed concurrent_state.xml and the activation_state.xml files.
Restarted Citrix License service to recreate the concurrent_state.xml and the activation_state.xml files

Cause:

The concurrent_state.xml and/or the activation_state.xml files become unusable and the Citrix Licensing Service, lmadmin.exe, does not properly handle the unusable file and crashes – This may be due to corruption of XML Files
Office scan exclusion were not configured so there may be high chance of file(XML) corruption due to scanning blocks and XML files are easily corruptible .
Above file corruptions are Known issues – Please check articles (http://support.citrix.com/article/CTX129747 & http://support.citrix.com/article/CTX200151)

Recommended Action:

Plan for upgrade from citrix licensing server from 11.90 to high version (>11.10) to arrest all known issues– Makes sure to have valid SA else upgrade not possible.
Make sure to follow antivirus exclusions for xenapp folders.

Screenshots

Hotfix Name Changes for XenApp/XenDesktop 7.5

June 25, 2016 / Ram Prasad

Information

This article explains the changes for Citrix hotfix naming conventions in XenDesktop 7.1/7.5 with the introduction of XenApp 7.5.

Hotfix Name Changes

With the reintroduction of XenApp in version 7.5, the same underlying components are used for XenApp and XenDesktop. As a result, the ‘XA’ and ‘XD’ designation will not appear in the hotfix name. Instead, the component name is prepended to the hotfix name.

Example 1

A hotfix previously named XD750DStudioWX86001 will now be DStudio750WX86001.

Hotfix Version Number Association

The following components did not change between version 7.1 and 7.5. From now on, updates to these components will only contain the 7.5 association in the name. The hotfixes will be available and compatible with both the 7.1 and 7.5 component versions.

Broker Agent
Desktop OS VDA
Director VDA Plugin
Enhanced Desktop Experience
Personalization AppV – Studio
Personalization AppV – VDA
Server OS VDA
StoreFront Privilege Service
Universal Print Client
Universal Print Server
WMI Proxy Plugin

Example 2

A hotfix previously named XD710ICAWSWX86006 will now be ICAWS750WX86006.

VDA Core Services Hotfixes and Machine Type Association

For the VDA core services hotfixes, the OS type is designated in the hotfix name. ‘TS’ for Terminal Server, the hotfix would apply to a Windows Server operating system. ‘WS’ for workstation, the hotfix would apply to a desktop Windows operating system. There is no correlation between the hotfix numbering for the Server OS (TS) and Desktop OS (WS) hotfixes. The ICATS hotfix ending in 007 might not have the same fixed issues as the ICAWS hotfix ending in 007.

Example 3

A hotfix named ICAWS750WX86007 is a VDA core services hotfix for a Windows 7, 8, or 8.1 operating system (32-bit).

A hotfix named ICATS750WX64007 is a VDA core services hotfix for a Windows Server 2008R2 , 2012, or 2012R2 operating system (64-bit).

Note: The previously released version 7.1 hotfixes will not be rebuilt with the new version identification, but the hotfix readme documents will reflect their support for versions 7.1 and 7.5.

Ref: http://support.citrix.com/article/CTX200156

Error: You cannot access this session because no licenses are available -Xen Desktop

June 25, 2016 / Ram Prasad

Symptoms or Error

A XenDesktop session fails to start, with the following error: “You cannot access this session because no licenses are available.” Event ID: 9027, Event ID: 1163

Event Source: Citrix ICA Service
Error 0 received while obtaining a license for a Citrix XenApp client connection.
The license request has been rejected.
- Note: The license error “flex code -18” shown in the DDC log stands for “License server system does not support this feature”.

The Broker log might contain the following errors:
- Controller:EventLogManager decided to log event CDS_EVENT_LICENSE_NONE_CHECKED_OUT of type Warning with arguments:
  - This is based on event log groups LicensingCheckout
- Licensing:MFLic_GetLicense result Success, request result Rejected
- Licensing:License request rejected, flex code -18

Solution

Resolution 1 – Event ID: 9027

Check the configuration of the site: XenDesktop License edition and model using PowerShell command “Get-BrokerSite” which shows license configuration for the site.

For Virtual Machine hosted applications

AppLicenseEdition
ApplicationLicenseModel

For XenDesktop sessions

DesktopLicenseEdition
DesktopLicenseModel

Compare the configuration with the licenses used in the environment from license server. Check the licenses in the license server. Change the site configuration using the Set-BrokerSite PowerShell command.

Examples

To configure the site to use the Platinum edition, use the following command:

Set-BrokerSite -DesktopLicenseEdition PLT

To set up Virtual Machine hosted applications to use Platinum edition:

Set-BrokerSite -AppLicenseEdition PLT

Resolution 2:

Download and install the most recent version of the licensing server.
On the DDC, restart the Citrix services: (This can be done on a live system and will not affect the users)
- Citrix AD Identity Service
- Citrix Broker Service
- Citrix Configuration Service
- Citrix Diagnostic Facility COM Service
- Citrix Host Service
- Citrix Machine Creation Service
- Citrix Machine Identity Service
On the License Server, restart Citrix services.
- Citrix Licensing
- Citrix Licensing Config Service
- Citrix Licensing Support Service
Access Desktop Studio > Configuration > Licensing > Change Licensing Server > Verify.
If the issue persists, restart all the DDCs in the farm one by one.

Resolution 3 – Event ID: 1163

In this case, upgrade the database as prompted by XenDesktop Studio.

Restart the Desktop Studio Server, error should no longer appear.

Resolution 4

Confirm all licenses are visible in License admin console.
Verify end date of subscription advantage of licenses are relevant to version of product.
Confirm start up server is showing in License admin console and hosts are communicating.
Import startup.lic or reinstall license admin console, which should create a new startup.lic.
1. Issue occurs mainly due to either corrupt startup.lic or no startup.lic under C:\Program Files\Citrix\Licensing\MyFiles
Restart Citrix licensing service.
Relaunch application, error should no longer appear.

Problem Cause

Event ID: 9027 Licensing is not set up properly in XenDesktop Site; licenses are not checked out on the license server when the user tries to start a new XenDesktop session.
License Model: Could be User/Device or Concurrent License Edition: Could be Platinum (PLT), Enterprise (ENT) or Advanced (ADV).
Event ID: 1163 XenDesktop Upgrade did not complete successfully and the following error was observed in the XenDesktop Studio: “A database upgrade is available. Learn more about this upgrade”
Startup license file is missing from the My Files folder.
corrupt startup.lic or no startup.lic under C:\Program Files\Citrix\Licensing\MyFiles

Additional Resources

http://support.citrix.com/article/CTX136266

Tech Blog

Page 7 of 8

In 15 Node cluster, majority of the virtual machines failed over(restarted) due to network fluctuations while performing network maintenance activity

All the VM’s are failing over(restarting) unexpectedly from one node to another in 15 Node Hyper-v cluster, post storage firmware up gradation.

Unable to Start any VM in one of the Hyper-V node cluster

Issues encountered post deployment of Netscaler VPX 10.5

XenApp- Applications are unable to launch from DR Web Interface server’s

Enabling Jumbo Frames on CISCO UCS blades -Hyperv

XenDesktop Controller Hotfix Update Procedure

Users are unable to launch the applications, license errors were seen during launching of application

Hotfix Name Changes for XenApp/XenDesktop 7.5

Error: You cannot access this session because no licenses are available -Xen Desktop

Latest News