In this post, I won’t discuss the merits – good or bad – of Citrix’s Session Reliability feature; that topic is best left to Citrix engineers. Instead, I’ll focus on the importance of understanding and managing the performance of the underlying network to ensure the best possible end-user experience, with an emphasis on potential connectivity issues that may – or may not – be related to the Session Reliability feature.
A common understanding is that Citrix Session Reliability may mask real network issues. This is true to a large extent: if a connection drops because of some fault on the network path between the Citrix client and server, Session Reliability will attempt to reestablish the connection so that the user can continue to work. The user may observe intermittent application freezes, and then to their session resumes as normal. So the underlying “network error” – whatever that may be – may remain undetected, causing delays and degrading the experience of other users.
Whose error is the “network error”?
“Network error” is a loaded term, as is “network timeout”. Both of these actually have little to do with the network understood as a system of switches, routers, cables, access points and so on. When we speak about network connections in a Citrix context, we are referring to the TCP sessions over which Citrix communication occurs, and the operating systems and applications that are responsible for establishing these connections, not “the network.” The network itself is responsible for carrying packets (IP over Ethernet, typically) that the client and server TCP stacks handle. When a TCP connection breaks, the Citrix session breaks and end user’s application window disappears – but this has more to do with the application than with the network. The trick is that the “application” may actually be a packet scheduler on a router, a firewall’s packet inspection engine or a load balancer software, or the OS-level settings on clients and servers that decide how the TCP stack behaves. Each of these may affect how network packets are handled and may sometimes decide to break the connections.
Session Reliability is designed to help in such situations, allowing a Citrix user to continue working despite intermittent connectivity issues. So the same way as doctors refine the benefits of different doses and combinations of medicine through repeated lab tests, Citrix admins and network performance engineers should refine how Session Reliability works for their end users.
An important point to remember is that Citrix Session Reliability is not the equivalent of the ICA keep-alive feature. Citrix online documentation and this blog is a good source of information on differences between those. The bottom line is that if you use Session Reliability, ICA keep-alives don’t matter; if you don’t use Session Reliability, ICA keep-alives are very important. However, from the network perspective and from the end user perspective they are related: both are designed to let Citrix users continue working with the application should the TCP connection break.
Understanding the impact of Session Reliability
The impact of Session Reliability on end-user experience can be measured by observing Citrix session flows between end users and Citrix servers, recording Session Reliability hand-off events. This is a core metric that helps you understand how Session Reliability works in your environment: when, where and how frequently is it invoked.
Session Reliability events can be measured using a network probe with a specialized network protocol decode for the Citrix network protocols ICA and CGP. (In fact, CGP is responsible for session reliability and it tunnels ICA.) The advantage of this approach is that visibility into Session Reliability events can be gained for the entire population of Citrix users from a single passive inspection point, providing insight into the impact of “application freezes” on end-users – before they become frustrated – and at the same time exposing the underlying network issues masked by Session Reliability that may warrant further investigation.
Given a user population of thousands, the occurrence of a small number of Session Reliability events is likely inconsequential. This is exactly what Session Reliability is designed to accomplish: it works to soften the rough edges of the whole user activity stream.
What about Real network issues?
In order to confirm that Session Reliability is not masking important network issues, one needs to look at metrics that characterize how efficiently the network forwards data packets. Two key metrics are round-trip time, which reflects network path latency, and retransmission rate, which reflects the packet loss ratio on the network path. (For those who look to understand mechanics of packet retransmissions: Gary Kaiser’s series of blog posts explains it in good detail.)
A look by client locations assures that the network is generally clean: marginal loss rate and low latency are reflected in the Network Performance metric which synthesizes these two measurements into a single easy to grasp KPI. For those who are curious: Network Performance reflects percentage of application traffic transferred in optimal conditions, where optimal is defined by a combination of thresholds on RTT and loss rate, and comparison to each location performance baseline.
Observation over a longer period of time – in order to analyze network performance trends – can confirm what network performance quality is normal and how it changes over time. Such a view is useful to spot small but steady changes that may go undetected when performance baselines are used to detect anomalies. It is also valuable for confirming that corrective actions taken a couple of weeks ago had the desired effect on network performance.
Network performance and the Citrix user experience
Looking at these key network performance indicators – retransmission rate and round-trip time – is an effective method of determining whether network performance is really affecting the applications delivered over the network. Typically this is not the case, and Citrix Session Reliability is not a means of improving network performance. It does not mask network performance issues either – if we keep in mind our precise definition of what network performance means.
There are higher-level metrics that can be used to understand and confirm the performance delivered by the network to individual Citrix sessions. Realized Bandwidth is a precise measurement of the network throughput experienced by each client connecting to a Citrix server. This is how the client really experiences “network speed.” A common but perhaps dated Citrix rule of thumb is that available bandwidth should exceed 28 kbps per active user, but the reality of modern applications dictates that it should be above 128 kbps to ensure acceptable user experience while working with Citrix-delivered applications.
This Realized Bandwidth measurement relies on precise tracking of client-server session flows to measure data transfer times and volumes over the network. In this case we can clearly see that throughput experienced by the users is well above the minimum recommended values, once again proving that the network is not a performance culprit.
Although Session Reliability nicely addresses Citrix connectivity hiccups of various kinds, it is not a cure to connectivity failures along the application delivery chain. It effectively is a patch, very convenient to use for Citrix admins, but as a patch – it requires management and refinement in close cooperation with network and application managers.
TCP connections are established by clients and servers, driven by applications that interface with TCP stacks and managed at the TCP session layer. This layer requires a different approach to monitoring and analysis: looking at packets and bytes is not enough. Instead, the monitoring tool has to reconstruct the TCP session flow for each monitored client-server connection, analyzing in real time each session’s TCP state model.
With such an analysis in place, we can look at TCP errors and analyze the real connectivity issues between clients and servers. We will have a look into this in the next blog post, where we’ll analyze why every TCP “error” is not really an error that matters, and the value of understanding application logic in the analysis of TCP connectivity issues.