Citrix Provisioning, WriteCache and the impact of “The After reboot jobs”

Edwin de Bruin
3 apr 2023
4 minuten om te lezen

A while ago me and my friend and colleague JP Ruitenbeek were “flown in” to investigate performance issues at a customer. We found multiple issues but in this one we focus on a specific one: “The After-reboot jobs.”

The Start: Lately a lot of complaints received the IT department of this customer about performance, latencies and login issues. The users were simply losing trust in the environment.

Sum up some of the complaints:

· Login times very high

· Slowness within the desktop

· Black screens

· Connection issues (Storefront errors and unregistered devices)

Since we both are not primarily involved with this customer, we could start fresh.

Get the emotion out of the equation and start measuring at multiple levels.

The problem when an issue is running for a longer period of time, there usually are a lot of emotions and probably assumptions. It’s the network, were not on the latest version, buy new hardware! My tip. start measuring. Maybe you need new hardware, maybe it's the version, but just validate it.

Example: When you have a kid at home t(y)elling to you “I have a fever I am sick” the first thing you normally do is get the thermometer an get an idea of what’s going on (ideally you do this 24/7 so you could see breaking trend and receive an alert, but the kids don’t accept a sensor 24/7). What I'm trying to say is you don't just assume it is correct and always validate.

Thats the same with these kinds of issues, found a cool meme which is very accurate:

So, we needed to start capture data, so what tools did we use?

SexiGraf – Graph the Hardware virtualization
InfluxDB – additional Database to store data captured and able to add as an extra layer to SexiGraf dashboards.

The finding, WriteCache filling up

When we started a new session with our test user opened up the PVS status tray

The Citrix WriteCache is quite filled up. Almost half is gone at the start of the session and as most Citrix engineers know.. WriteCache space filled up is a crash of the user session.

Can we graph this to get the whole picture? How does this go during the day? Does this happen to multiple users? Wrote a script to get the data and inject this in InfluxDB so we get a visualization:

Well, this is problematic, as you can see here the issue is way bigger. Remember 0% free is a session crash. Multiple dots near or on the 0% line. That explains the unregistered state and connection errors.

Why does the Write Cache get filled up so much?

Ivanti Automation:

Well, since this customer is using the Ivanti suite (Workspace Control and Automation) a usual suspect is “After reboot jobs” to add or remove software to the non-persistent VDI machine after the machine is booted.

Ah there we go:

I cannot share the content of these jobs, but one of them was to remove and reinstall Google Chrome and Adobe Reader... Sigh... get this but please just create a new image (Or to a new version of the Vdisk, in my believing system build a new one but add to an existing Vdisk is better than to inject this in the write cache)

But did you know Ivanti Workspace Control can also initiate Automation tasks? an example here:

Install software on login. As you can see only this MSI is almost 255 MB… Compressed….

So, we did some cleaning up and had some proper discussions with the customer about the tasks. Results are starting to show. We handed out homework to optimize even more but this is a good start.

Good news: We received the news that the crashed sessions/reconnect issues are dropping.

And the graph reflects this, not there yet but promising non the less:

Wait, there’s room for a little more! (Feeling like the ending of LOTR RTOK already?)

As we started measuring, we noted some strange peaks. We recorded the session count within InfluxDB so we could graph this in SexiGraf and corelate to CPU Ready and Disk latency.

1. Sessions really start to rise between 07:00 and 09:00 (not unexpected)

2/3/4. Strang peaks in CPU Ready and Disk latency wich correlate to the boot schedule configured in Citrix. Some additional load is expected but is way too much.

Guess what… When the “After reboot Jobs” and Ivanti Workspace Control jobs cleaned up… massive drop in disk latency at boot schedule and the CPU ready at same boot schedule and around 08:00 at user logins. The after reboot and login tasks created a cascading effect... Mind the scaling on latency... (we are aware of relatively high CPU ready's overall maybe more on that in a next blog)

Conclusion:

Fun story and all, but what are you telling me?

Actually 2 things:

1. Start measuring and collect data, so you can talk about facts with the customer (and fellow engineers!)

How cool is it to show the changes you make actually have effect? To help you in the discussion about why to or not to do things? Even better would be to collect this data 24/7 so you always have this data at hand. Not the case here so used some custom tools but there are products out there who will do this out of the box!

2. Don’t, I mean DON'T use “After Reboot Jobs” or install software on login with non persistent VDI if not absolute necessary. There are valid scenario's but if you do use them know the why and plan to mitigate them... a simple “that’s the way we’ve always done it” should not suffice!

Any questions, remarks, want to know how we captured some stuff? please don't hesitate to let me know :-)

Citrix Provisioning, WriteCache and the impact of “The After reboot jobs”

Recente blogposts

Comments