PVS stream service abrupt termination intermittently (approx. once in month) which causing user sessions freeze and user unable to launch HSD’s.
- Issue occurring once or twice in a month and there is no common pattern in days or hours,issue recurring in both PVS servers at a time
- No changes in environment
- Onsite engineer informed that issue existed since 3 months and issue getting resolved post restart of PVS servers.
- One day, same issue repeated but issue not sorted out post restarting of PVS servers -> Issue escalated to support team (Me)
- Observed Event Id 11 :”Detected one or more hung threads , DbAccess error: <Record was not found> <-31754> (in ServerStatusSetDeviceCount() called from SSProtocolLogin.cpp:2903” -> Indicates “Thread hangs under the stream service” & DB Access errors
- Observed multiple vDisk retries on the problematic target devices. 11 at boot time and approximately 611 per hour during session
- Observed recommended MacAfee exclusions are not in place -> Stopped MacAfee service and restarted PVS server -> PVS Streaming service stable for some time on one PVS server and again terminated ->Due to time constraint, logged a call with vendor(Citrix).
- After 2 hrs, Citrix support joined the call and started collecting CDF races and procdump collection for the terminating stream service
- After few hours , issue resolved automatically and Citrix support unable to find root cause with collected logs
- In 2 months , issue repeated 2 times and customer frustrated as root cause was not found for abrupt streaming service termination intermittently.
- Support Team (Myself) analyzed the environment and observed the Cache mode is configured as “ Cache on Server” which is not recommended for Production environment , Best practice to use “Cache on RAM overflow to HDD” which is a best practice to reduce load on PVS server & optimal performance ->Taken the same observation Citrix support and requested their observations
Explained to customer that missing of best practices will lead to these type of intermittent issues , since there is no root cause found and it is not a best practice to keep cache on server in production environment , prepared a plan to change cache configuration to” Cache on RAM overflow to HDD”.
Current PVS Storage configuration for cache as below
PVS1 (VM)->1700 GB allocated through Virtual HBA ( Total golden Image Sizes is 440 Gb & Remaining for Write Cache)
PVS2 (VM) -> 1700 GB allocated through Virtual HBA ( Total golden Image Sizes is 440 Gb & Remaining for Write Cache)
Proposed Storage change Configuration as below:
Post referring multiple blogs, Write Cache proposed to all images(profiles) is 20 GB -> Therefore , for 86 HSD, 1820 GB required and it should present to complete Hyper-v cluster as HSD hosted on cluster.
- 1820 GB -> To allocate Hyper-v Cluster ( To Create 20 GB Write Cache for each VM)
- New LUN of 800 GB to PVS 1 (VM) -> To store Golden Images ( Taken new LUN for easy migration from old LUN and extra space taken for future requirement)
- New LUN of 800 GB to PVS 2 (VM) -> To store Golden Images ( Taken new LUN for easy migration from old LUN and extra space taken for future requirement)
- Post Cache change and migration, planned to release old LUNs’ ( 1700 & 1700 ) to storage team
How to change Cache Mode for existing production Vdisk:
- Get new storage drive as explained above and assign the same drive letter in PVS 1 & PVS2 as it is under Load Balance.
- Log into PVS Server 1 ( Initially do only changes in one PVS server).
- Copy the Golden Image(.VHD & PVP files) to new drive.
- Defragment the vdisk by right clicking the VHD and Mount -> This step is doing as per best practice to achieve optimal performance.
- Create New Store in PVS Server and map the path to New drive.
- Import the VdIsk to New Store.
- Change vdisk to “Private Mode”.
- Create new device or change existing device (To boot vdisk in Private Image mode).
- Go to Device Properties in device collection, change the vdisk path ( to newly imported vdisk) ->Make sure to change the boot from Network to vdisk.
- Go to VM Settings ->Remove IDE Controller-> DVD Drive ( As Write cache drive need to map as “D drive”).
- Under IDE Controller -> Create one HDD 20 GB with Fixed drive to assign to HSD (To Write Cache drive to VM ).
- Power on & Login to VM , format the drive with MBR ->Assign the drive letter “D” drive (Give Volume name as “Write Cache” & make sure your vdisk in Private mode by accessing vdisk tray icon).
- Restart VM.
- As an optional , it is best practice to redirect the Page File to Write Cache Drive ( Page File planned to keep as 4 GB post referring few blogs).
- Post restart of VM, Go to Page files settings and configure page file(4GB) only to “D drive”.
- Go to PVS Server -> vdisk pool -> Change the vdisk from Private to Standard (Note: Cache options will visible only in Standard Image Mode).
- Change Vdisk Cache to “Cache on RAM with overflow to HDD” and assign 4 GB RAM in cache ( 4 GB cache is decided post referring few blogs).
- Again boot HSD and observe the Cache status (It will be visible as Cache on RAM overflow to HDD and vdisk as Read-only).
- Assign this vdisk to any Test VM(device) and ask users to test.
- If Image is fine then shut down the devices attached to vdisk -> And copy VHD & PVP (Don’t copy LOK files) to second PVS server -> Make sure replication should be green
To replicate to other devices(VM’s)
- Follow same steps 11th to 15th for rest of the VM’s.
- Reboot HSD.
- Test the HSD accessibility & Cache configuration.
Since exact root cause was not found, vendor provided & self-analyzed would be the cause for abrupt streaming service termination
When each Target Device boots up the OS is not aware of the Write Cache and writes to the logical disk that it is presented (the vDisk). The PVS driver then redirects this data at the block level to the write cache which in your environment is held on the PVS server. When BNIStack driver ( transport stack for communicating with the PVS server) is then initialized, it will pull down chunks of the vDisk as and when they are needed from your vDisk store which is on your PVS server also. The BNIStack driver will also access the write cache directly when it needs to issue additional writes from the target device. The communication in relation to the above is carried out between the BNIStack driver on the target device and the Stream Service on the PVS server. The case notes state that environment have approximately 2000 users which I suspect is putting a large overhead on the Stream Service responsible for servicing all Write Cache requests.
The Target Device BNIStack driver is also responsible for retries (because UDP does not). The base timeout for a packet timeout is 10 seconds. If the server responds quickly, this value is reduced by half all the way down to 1 second. Correspondingly, if the server responds slowly the timeout will double all the way up to 10 seconds.
A retry timeout of 1 second or less may cause excessive I/O retries, leading to slow response and hanging of target devices which will then ultimately lead to a stream service failure. Since we know we are seeing issues with the stream service and subsequent hanging/crashing of this service and considering we are aware of the large write cache overheads incurred on the stream service already, I suspect that the load we are putting on the stream service is too great, culminating in it grinding to a halt, this will then present itself with the symptoms we see at the target device e.g. slow logging in, sluggishness inside the session.
As explained above, the recommendation to mitigate this is to move the Write Cache away from the PVS Server itself and place the processing overhead at the target device instead.
Recommendations & Reference
- We have been observing the environment for about 2 months now and the issue has not occurred since we had disabled the antivirus services.
- As a best practice, upgrade all environmental components (servers and target device software) to a required level, either PVS 7.6 LTSR CU3 OR PVS CR 7.13.
- FYI: The recommendation for desktop operating systems start with 256-512MB and for server operating systems start with 2-4GB.
- PVS RAM Cache consideration -> https://www.citrix.com/blogs/2015/01/19/size-matters-pvs-ram-cache-overflow-sizing/.
- PVS Antivirus best practice -> https://support.citrix.com/article/CTX124185.