This guide contains troubleshooting procedures and workarounds for Synthetic Classic tests running on Backbone nodes that have experienced issues: failed tests, slow response times, missing alerts, etc.
The initial goal in a Backbone node investigation is to determine if the issues were caused by:
- Website content or configuration changes
- Dynatrace node issues
- Real world issues
Failed tests and/or slow response times
Step 1: Gather information
Were there any changes to your website?
Did your team deploy new website content or code, or change any configuration? Have you checked with your engineering/development group? If the website changed, when did the changes happen, and did they coincide with the Synthetic Classic test issues?
Check the Synthetic Classic Portal system status
Check the Service and Node Status Notifications page. Did Dynatrace deploy any product or node upgrades, or did the node experience any known issues?
Determine the time frame during which the issues occurred
Create a raw scatter chart that plots the failed tests with a time period well before and after the issue occurred. Sometimes the issue has already been occurring without your knowledge. You may also recognize a pattern to the failures; for example, they may only happen at peak usage times or when the servers are rebooted in the middle of the night.
Look for a correlation between any recent changes to your site content or configuration. Also check for Dynatrace product releases or node upgrades that were deployed near the time of the failed tests and that may have caused (or fixed) the issue.
Determine the test's availability
Chart the availability of the test. Is the test failing during 1% of the runs, or 100%? Is there any obvious failure pattern?
Determine the scope of the issue
- Compare nodes – Are the failed tests occurring on all of the configured nodes? Generally speaking, the more nodes with failures, the less chance that it is a node issue; but there are exceptions. Be mindful that product releases and node upgrades may affect multiple nodes.
- Compare tests – Is this issue occurring on other tests that are running in your account?
Step 2: Check the health of the nodes that had the issue
Check the following Dynatrace Community pages for notices that may be related to your investigation:
When you are logged in to the Community, you can click Watch at the top of the page to be notified of page updates.
Step 3: Investigate the details of the failed tests
Review the problem analysis
The Problem analysis page shows information for the 6-hour period ending with the selected test execution, so you can analyze the problem in context and zero in on the root causes.
Use Root Cause Analysis
Root Cause Analysis analyzes the test execution events surrounding the time when a suspected abnormality occurred in a Backbone (including Mobile-over-Backbone) or Private Last Mile test. It identifies most common performance, availability, and page content issues that may have led to that situation occurring. For each issue discovered, Root Cause Analysis provides a list of the top ranked items that contributed most to the issue.
Enable Screen Capture on Error
Screen Capture on Error (SCoE), also called Advanced Analytics, is often very helpful in determining the cause of an issue. If it isn't already enabled for your tests, turning it on should be one of the first things you do after an issue is detected. SCoE will record a screen capture at the time the test failed. It also provides the HTTP requests and responses (or lack thereof) for each object.
- Look at the screenshots for error messages like "Service not available" or "500 Internal Server error".
- In the HTTP headers section, look for an object that has no reply. This may give you a good starting point for your investigation.
There may be an additional cost for SCoE. Consult your contract, or contact your salesperson or account manager for more information.
Enable Trace Route on Error
Trace Route on Error (TRoE) runs when a test page fails without downloading any objects, because of a network failure.
Network errors might include:
- Server Response Missing Status Line (10001)
- Network Reset (10052)
- Socket Connection Reset (10054)
- Connection Refused (10061)
- Connection Timeout (11005)
- Socket Receive Timeout (12000)
- Unknown Connection Error (19999)
The results of the TRoE are located on the last page of the SCoE.
Run instant tests
There are several tools in the Backbone instant test that may help you diagnose the cause of an issue, especially if it is still occurring. They will run on the Backbone node that you specify:
- DNS issues – Use Dig or NSLookup to get real-time DNS configuration information.
- Routing issues – Use TcpTrace to get real-time routing information. TcpTrace generally provides more information than
tracert. Make sure you use Port:80 for HTTP addresses and Port:443 for HTTPS addresses. TcpTrace is similar to an HTTP request, whereas
tracertuses the lower-level Internet Control Message Protocol (ICMP). Some sites may have ICMP turn off to prevent denial of service (DOS) attacks.
Look for slow or missing objects
When you find an object that was not served or that had a slow response time, the IP addresses for the object can be found in the Portal's waterfall charts. You can search the web for "IP lookup" to get the owner of the IP address from which the object originates. Look in the name for Akamai or other recognizable Content Distribution Network (CDN) providers. If the object with an issue is being served by a CDN, you can ask the CDN to open a Support ticket to investigate.
Check the component times of the failed tests
Open the failed tests in a waterfall chart, and look for long bars. Always keep the relative time scale on the X axis of the graph in perspective so as not to be misled by the scale. Look for the bars that represent long First Byte Times. Long First Byte Times are usually a sign that a site server is under heavy load.
Check your server logs
You can search your logs for issues that occurred at the time of the failed tests.
- Search your server and error logs for the user agent string
GomezAgentto find requests from our nodes at the time of the errors. You may find issues that prevented or delayed serving the object.
- Check your server metrics at the time of the failed tests for high CPU usage, high memory usage, number of sessions, etc. If the metric values are too high, you may need to change the site configuration or add server capacity.
You can request a trial of Dynatrace AppMon and User Experience Management (UEM) to gather very detailed performance monitoring data for future incidents.
Check node IP addresses
The IP addresses of our public Backbone nodes (the locations from which Synthetic Classic runs the tests) are listed in the Synthetic Classic Portal under the Node Manager. The IP addresses are static; they don't change. If your site requires white-listed IP addresses, you will need to add the address of each node that is running your tests.
Step 4: Open a Support ticket to check the node metrics
If the preceding steps don't identify and resolve the test issues, open a Support ticket. Contact your Dynatrace Guardian, or open the Support ticket through the Support page. Ask the Guardian or Support team to check whether the nodes that are experiencing test issues are functioning within their performance thresholds.
Step 5: Try workarounds
If the test is failing because of timeout errors, you can increase the timeout limits as a workaround.
Timeouts exist at the action level, step level, and script level. If you significantly increase the timeouts, all three levels may need to be adjusted. Contact Customer Support to increase the timeouts at the step or script level. Keep in mind that the Browser Agent has a hard stop of 300 seconds (5 minutes). If the hard stop is hit, the data will be automatically reclassified ("auto-cut") and will not be displayed in the Portal. Do not increase the timeouts more than needed.
If the timeout is happening at the beginning of a step and there have been no objects downloaded for the step, chances are the issue occurred in the previous step. Look for the response code 299 Request Aborted at the end of the previous step, which indicates the script timed out and the browser shut down before the objects arrived.
If the timeout is happening in the middle of a step, you can try adding an extra Wait for Network action after the default Wait for Page Complete action. This allows extra time for asynchronous calls (AJAX) to complete.
Why you shouldn't increase timeouts
Most of the default timeouts are 30 to 60 seconds. if a real user has to wait 30 to 60 seconds for a page or control to render, chances are they will abandon the page. That is a bad user experience, so your focus should be on improving the performance of the website or web application rather than increasing the scripted timeout values.
Validate at the end of each step
We recommend adding a Validate action or a Wait for Validate action at the end of every step if possible. Copy the CSS/DOM locators from the Click action in the following step into the Validate action. That will contain the error in the correct step and will aid in troubleshooting. Because of real-world differences in browsers, websites, and technology, sometimes one action work and the other doesn't; and sometimes neither works as expected.
If you have not found the cause of the issues, one possible workaround is to move the test to another node, preferably a node in the same city or geographic region. This does not fix the issue, but it may temporarily fulfill your monitoring requirements. Follow up with Support to see if there were any real-world issues on the network.
Switch to a different Browser Agent
There are many reasons this option may come into play; for example, your website may no longer support the browser version used by the Dynatrace monitoring nodes. You can try changing between the available Internet Explorer, Firefox, and Chrome Browser Agents.
Data points missing from charts in the Portal
Your test may be hitting the hard timeout deadline of 300 seconds and being automatically terminated. If this happens, the data is automatically reclassified ("autocut") and is not displayed in the Portal. The Support team can query the backend data to determine if the test runs were reclassified.
Less common cause
Missing data may be caused by an issue on the node, such as test replication or the test schedule not functioning correctly. This may require further investigation by the Support team.
Do the following to troubleshoot alerts that are not received:
- Check whether the test was executed during an alerting maintenance window. If it was, no alerts would be sent even if the test met an alerting threshold. Go to the test settings to see which maintenance windows the test is assigned to.
- Check that the alert is present in the Alert logs page.
- Make sure the email addresses in the alert destination can accept email from outside your company's firewall.
- Check that the global alerts are turned on at the account level, in the Account Settings.
- Check that the individual alert is turned on in the alert settings for the test.