Co-written by: Kris Ziemianowicz
It’s always been true that vendors like to tout impressive performance numbers; if nothing else, it makes competitive marketing quite simple (and simplistic). 40Gbps throughput! Line-rate SSL decryption! Unlimited storage! It follows that vendors adopting this approach will choose measurement criteria that best highlight their point, even though those criteria may not reflect real-world network and application requirements.
Remember the router wars of the early 1990s? Kevin Tolly and Ed Mier built careers and companies on lab evaluation and certification services, with the goal of validating marketing claims by standardizing and documenting test criteria; packet size and interface mix were often factors that could make or break the marketability of the results. “Winning” to me seemed very rewarding; I worked for a small router company that claimed greater throughput than Cisco, claims that were briefly proven in Ed’s lab. What happened to that company? In today’s lexicon, it failed fast; raw speed couldn’t make up for limited abilities – such as usability, manageability, serviceability, scalability, and reliability. (To say nothing of the burning smell often emanating from the chassis.)
Whether you call the solution AA NPM, Transaction-centric NPM, probe-based APM, or something similar, the question of solution sizing will at some point become important. For this blog, I’ve asked Kris Ziemianowicz, our senior DC RUM product manager, to put some context around the performance numbers often associated with probe-based application and network monitoring solutions, and expand the discussion to include a broader perspective of what we consider scalability, with the emphasis on “ability.” (I’ve also injected some of my own commentary where I thought emphasis would be valuable.)
What is throughput?
We’re all familiar with the term throughput, probably from considering the performance of routers or switches, where throughput is the rate of successful packet forwarding. We also know that this measurement depends on a number of factors such as packet size and traffic type, factors that contribute to the amount of processing required to make forwarding decisions. For an AA NPM monitoring solution, throughput can be taken to mean something quite similar; the traffic rate which can be processed by the analysis engine. But unlike routers and switches, which are fundamentally stateless because they forward each packet independently, most monitoring solutions (certainly those that are relevant to this discussion) must create and maintain state tables for traffic analysis. Clearly, then, the characteristics of the analyses will factor heavily on throughput. If the analysis is rather light – let’s say the probe is simply counting packets – throughput will be greater than if the analysis is complex – let’s say the probe is parsing variable-length transaction parameters and user names from application payload.
Even with similar analysis, traffic characteristics matter. Using large TCP packets will usually inflate reported performance measurements since there are fewer interrupts to service and less content reassembly work. (In fact, artificial traffic “shaping” could inflate throughput results by more than 100%.)
So the implication is clear; performance measurements should reflect real-world environments, not artificial lab tests designed to inflate numbers (and egos). At Dynatrace, we design performance tests based on customer-observed application traffic patterns and profiles, while applying the full range of transaction-centric analytics that are unique to DC RUM. So when we publish throughput values for HTTP analysis, for example, the number is driven by real-world web traffic, applying analytics including full page flow re-assembly, content decompression, user name recognition and tracking across all pages, parameter parsing from requests and responses, and parsing errors from the response content using pattern matches. Unfortunately many of these capabilities are not available elsewhere, making direct throughput comparisons difficult.
In recent blogs, we’ve highlighted some of the important value propositions inherent in DC RUM’s transaction-centric approach to NPM. These include:
- Business collaboration
- Development collaboration
- End-user experience driven problem resolution
Important here is the emphasis on “metrics that matter,” where more value is delivered by analyzing and surfacing a handful of clearly actionable performance insights than would be by offering hundreds or even thousands of metrics of questionable value. These insights and benefits quickly overshadow any focus on raw throughput numbers.
Scale in the real world
Let’s expand the conversation beyond simple throughput and get at the core issue of scalability. There are more practical approaches to monitoring large distributed networks and complex application landscapes than simply building a bigger probe. In many cases, deep transaction visibility may not be necessary for all applications; instead, NetFlow has proven to be a valuable and well-understood source of application flow and volume metrics, with NBAR2 offering comprehensive and quite granular application classification.
There are two basic cases where NetFlow is frequently used to complement DC RUM:
- Visibility into large volumes of traffic within a data center, providing basic who, what, when, where, and how much insight into network traffic. This approach is used to complement DC RUM’s transaction-centric analysis applied to critical business applications.
- Visibility into intra-branch traffic, and inter-branch traffic where branches can communicate directly with each other (and the Internet). Since the traffic doesn’t flow through a central point or data center, NetFlow offers core visibility without the requirement for deploying probes.
Of course NetFlow doesn’t offer the deep transaction-centric insights that are core to DC RUM’s value. But for DC RUM, integration with NetFlow isn’t an either-or proposition; sophisticated deduplication algorithms surface the richer (i.e., probe-based) measurements for reporting and alerting.
Given the combination of probe-based measurements and NetFlow, we have customers monitoring two data centers and 1500 branch offices using just two network probes (one per data center).
While scaling up may boost egos, scaling out builds character. There are many practical justifications for a scaling out approach that warrant the additional development and architectural investments that Dynatrace has made in DC RUM.
In case of a single data center, DC RUM can load-balance traffic between multiple network probes and consolidate the measurements reporting/alerting time. No matter how large the probe, there will always be cases where a single probe cannot provide the required throughput. This load-balancing approach becomes even more important in the case of multiple data centers, where probes must be distributed but measurements consolidated into a single view of application and network performance.
For DC RUM, multiple geographically dispersed probes appear as a single distributed probe from a reporting and alerting perspective. Similarly, multiple report servers can be connected into a cluster that, from the end user perspective, looks like a single report server. These scale-out capabilities address real requirements from customers with multiple data centers, avoiding ugly alternatives such as separate monitoring systems or forwarding raw packet data to a remote probe.
Fault-tolerance has to go hand in hand with scalability
The requirement for high analysis throughput simply means that there are many important applications to be monitored. If APM is “the translation of IT metrics into business meaning,” then it follows that important management processes and decisions rely on APM data. (If that’s not the case, then it may be time to revisit the factors driving APM adoption.) This means that the APM system must not fail and thereby stop providing business insight. DC RUM is designed with redundancy and fail-over capabilities in mind: network probes and report servers can be configured for hot-standby failover, ensuring uninterrupted monitoring and alerting should a hardware failure occur.
Security considerations for SSL and PCI compliance
A related architectural consideration addresses important security requirements. The network probe is a device that has access to the SSL keys and insight into all data traversing the network, including sensitive information. Even though it does not store this data, it is still considered a node on the network that has access. In a security-conscious environment, you would not want to connect a device to both the PCI-secured zone (where the network probe requires access) and the user zone (where APM specialists access performance reports). Instead, a PCI firewall is used to isolate solution tiers that have to have access to the network data (the probe tier) and those that operate on “safe” data (the report tier). DC RUM’s architecture accommodates the use of a PCI firewall between the probes and report servers, ensuring that no sensitive data leaves the PCI zone. This consideration should be independent of – or in addition to – any vendor promises of security.
There is of course no “one size fits all” solution. Experience suggests that you keep two factors in mind:
- Begin by adopting a solution that meets your current maturity level
- Ensure that the solution can readily scale and expand as your maturity level increases
The assumption – an important one – is that you’ll achieve quick successes and clear value, and that you’ll progress rapidly on your Performance Journey.