In last week’s post I looked at Citrix Session Reliability and its relationship to key network performance indicators. I concluded by saying that measuring TCP connectivity issues is essential to an understanding of the user’s overall experience, and also that connectivity (or availability) from the user’s perspective likely has little to do with the underlying network performance. Instead, connectivity needs to be analyzed with application flow specifics in mind, starting with TCP session behavior analysis and moving up toward the application layer. I’ll look at this approach in this blog.
It’s important to understand whether or not a TCP “error” is really an error that affects the application user. We’ll use the following illustration to emphasize the importance of focusing on what matters before calculating availability; this filtering can often make a big difference in the results. First, a look at a report showing TCP Connectivity along with TCP Errors.
The vast majority of server-side TCP session terminations initiated by TCP resets occur simply because the server closes idle client connections (I’ll discuss this in more detail later in this blog), so we can filter them out from the time chart. The picture changes significantly.
Understanding what the “error” really means is essential to assess real availability problems from the application perspective. Distinguishing real connectivity issues from the “noise” helps prevent a common source of false alarms or anxiety triggered by “there are a lot of errors, what should I do?”
Understanding the impact of connectivity issues: who and where?
Once we’re able to focus on the metrics that matter, the next step is to look at the big picture: is the TCP connectivity problem severe, affecting everyone, or is it isolated to some specific servers or locations? A look at all Citrix users from all locations at branch offices and the Internet will reveal whether connectivity issues are isolated or pervasive.
In this example, we see that TCP connectivity issues point to some locations more than others, and also that the overall volume of those errors is very low. We can interpret this to mean that TCP connections are generally stable. Sure there will always be some outliers, and therefore it is even more important to track trends and baselines to pick up anomalies.
Network equipment failures are very rare these days. Equipment like servers, switches, firewalls, load balancers, and WAN optimization controllers is reliable on the physical connectivity layer. Even with intermittently overloaded network devices, logical connection will be sustained; sessions may slow down, but what we will observe will be some network packet drops. TCP reliability takes care of such packet loss issues via retransmissions, and the application performance monitoring tool should report them under a metric such as Loss Rate (or retransmission rate – a subtle difference that I may explain another time). Although retransmissions are generally bad – and can sometimes severely impact application response time and end user experience – they are rarely the cause of connectivity issues. The chart below shows no connectivity problems even in the presence of a higher-than-normal packet loss rate.
Why do connectivity errors occur?
Connectivity issues that impact users typically occur at the TCP connection level – connections are abandoned, closed or reset as a conscious action by the TCP stack on the client or server. Let’s take a closer look.
Clients generally initiate “graceful” TCP connection closures via a TCP FIN packet, and both server and client agree and confirm closure. When the server terminates the connection, it usually sends a TCP RESET packet, which can be reported as a “server session termination error”. However, it typically occurs not because of an error, but because the server no longer sees the client – for example, the client strayed from WLAN range – so the server chooses to close the connection. As a good application citizen, the server notifies the client (just in case the client is still listening) by sending the TCP RESET message.
So typically the TCP RESET from server should not be perceived as an indicator of an application failure. Application performance monitoring should measure them and track baselines, but not report them as connectivity errors – as illustrated in the screenshot below.
Far less common are cases when a server sends a RESET because of an application failure or security violation – these are quite rare, but can still occur. Therefore, looking just at the “server session termination error” (or TCP RESETs from the server) is not enough to distinguish between desired connection termination and abnormal behavior. Application-level metrics are needed to supplement this data; for Citrix, these are errors signaled in ICA messages and Session Reliability events when an affected client reconnects to the Citrix server.
Note that “server” here can be any device that terminates the TCP session. It does not have to be the Citrix server – it can be a stateful firewall or load balancer in the path between the Citrix server and the client. As a result, there may be many possible conditions, including:
- The firewall closes an idle client connection because client disappeared from the network (in case of a stateful firewall)
- The load balancer closes client connection for the same reason (the load balancer may leave the server connection open at the same time or close it gracefully)
- The server closes an idle client connection
- The server closes a connection because the application decided the user should be logged off due to inactivity
There are many similar situations where some combination of these results in an element of the application delivery chain deciding to close a connection that it deems is not needed anymore. WLAN roaming is a typical case resulting in the abandonment of TCP connections, requiring these abandoned connections to be closed by a RESET. Idle connection timeout settings on firewalls, servers and load balancers are typical triggers of these connection resets. Inefficient – often detrimental – values may configured for these timeouts in multiple places, and these configurations are frequently the root cause of undesirable connection resets, ultimately impacting the end user.
Consider this: Citrix Session Reliability needs nearly 30 seconds to detect loss of connectivity to the server (e.g., because the device lost the WLAN connection and checked in at another access point). After detecting this, it initiates an attempt to reconnect. The reconnect itself may take 100 seconds or more in the worst case. If there is a well-behaving stateful firewall in between with a relatively low connection timeout configured (this is common to conserve resources), it may decide to gracefully close connection to the server because the client is no longer connected. When client’s Session Reliability algorithm finally reconnects to the firewall, and the firewall opens connection to Citrix again – Citrix no longer has a session for the client to reconnect to. The effect? The user will have to log on again. The solution? Set appropriately lengthy idle connection timeout values on the firewall, allowing the server to determine when to close an idle connection.
Controlling connectivity issues
A successful connection means that a TCP session has been established and data exchange occurred between client and server. It is important to distinguish such desired cases from situations where a connection is established, but no communication occurs and connection is eventually reset or closed by the server. Again, looking at TCP session flow from the application perspective is essential to understand real connectivity issues.
Network equipment needs to be configured with parameters that are reasonable for Citrix environments. For example, Citrix servers (or NetScalers) should allow connections to remain active for some time, even if the client disappears intermittently. Make sure the Network team is aware of these Citrix connectivity requirements, including the client throughput requirements.
The user’s interaction with the application is also important to understand.
- Unless we speak about kiosk-like point of presence for an application, Citrix should be configured to close inactive sessions for two reasons: security and conserving Citrix server resources. The end users should be aware that they may find their application gone when they return from a coffee break; that’s normal behavior.
- The application itself (e.g., SAP) will typically log off a user if a period of inactivity is detected. From an end user perspective it may look like “I have been disconnected again.” The combination of timeouts may cause a sequence of prompts that results in user frustration. For example, a user may reconnect a terminated Citrix session only to find an application window that says “your session has been closed due to inactivity, please log in again.”
Since we don’t expect end users to become experts at the subtleties of Citrix and application timeout values, it is important to configure these carefully. When the discussion comes down to analysis of specific network session behavior – network packet level evidence can be priceless. See Mike Hick’s blog post on power of this approach.
The benefits of a Citrix solution are well-understood, and in spite of seemingly significant complexity, understanding connectivity behavior is really a matter of visibility. Monitoring the activity and errors for all users at both the Citrix and application layers provides a unified perspective for understanding whether network quality and capacity issues exist or are imminent, and whether potential misconfigurations along the application delivery chain are affecting the user’s experience. Separating these from more traditional application delays is an important first step in performance analysis.