Issue:

PVS stream service abrupt termination  intermittently (approx. once in month) which causing user sessions freeze and user unable to launch HSD’s.

Environment :

2 Citrix PVS Servers (VM’s) with version 7.6
2000-3000 concurrent  Users
86 HSD’s & 6 Golden Images
Microsoft Hypervisor 2012R2 ( 15 Node) – CICSO UCS

Observations:

  • Issue occurring once or twice in a month and there is no common pattern in days or hours,issue recurring in both PVS servers at a time
  • No changes in environment
  • Onsite engineer informed that issue existed since 3 months and issue getting resolved post restart of PVS servers.
  • One day,  same issue repeated but issue not sorted out post restarting of PVS servers -> Issue escalated to support team (Me)
  • Observed  Event Id 11 :”Detected one or more hung threads , DbAccess error: <Record was not found> <-31754> (in ServerStatusSetDeviceCount() called from SSProtocolLogin.cpp:2903” -> Indicates “Thread hangs under the stream service” & DB Access errors
  • Observed multiple vDisk retries on the problematic target devices. 11 at boot time and approximately 611 per hour during session
  • Observed recommended MacAfee exclusions are not in place -> Stopped MacAfee service and restarted PVS server -> PVS Streaming service stable for some time on one PVS server  and again terminated ->Due to time constraint, logged a call with vendor(Citrix).
  • After 2 hrs, Citrix support joined the call and started collecting CDF races and procdump collection for the terminating stream service
  • After few hours , issue resolved automatically and Citrix support unable to find root cause with collected logs
  • In 2 months , issue repeated 2 times and customer frustrated as root cause was not found for abrupt streaming service termination intermittently.
  • Support Team (Myself) analyzed the environment and observed the Cache mode is configured as “ Cache on Server”  which is not recommended for Production environment , Best practice to use “Cache on RAM overflow to HDD”  which is a best practice to reduce load on PVS server & optimal performance ->Taken the same observation Citrix support and requested their observations

Explained to customer that missing of best practices will lead to these type of intermittent issues , since  there is no root cause found  and it is not a best practice to keep cache on server in production environment , prepared a plan to change cache configuration to” Cache on RAM overflow to HDD”.

Current PVS Storage configuration for cache as below

PVS1 (VM)->1700 GB allocated  through Virtual HBA ( Total golden Image Sizes is 440 Gb & Remaining for Write Cache)

PVS2 (VM) -> 1700 GB allocated through Virtual HBA ( Total golden Image Sizes is 440 Gb & Remaining for Write Cache)

Proposed Storage change Configuration as below:

Post referring multiple blogs, Write Cache proposed to all images(profiles) is 20 GB -> Therefore , for 86 HSD, 1820 GB required and it should present to complete Hyper-v cluster as HSD hosted on cluster.