Data investigation guide for Keynote Synthetic Monitoring

Overview

This guide contains troubleshooting procedures and workarounds for Keynote Synthetic Monitoring measurements that have experienced issues. Issues can include:

The initial investigation goals are to determine if the issues were caused by:

  1. Website content or configuration changes
  2. Keynote monitoring agent issues
  3. Real world issues

Issue: Failed tests and/or slow response times

Step 1: Gather information

  • Were there any changes to your website—Did you deploy new website content or code, or change any configuration? Have you checked with your engineering/development group? If so, when did the changes take place, and did they coincide with the synthetic monitoring test issues?
  • Determine the time frame during which the issues occurred—Create a Scatter Plot chart that graphs the failed tests with a time period well before and after the issue occurred. Sometimes, the issue could already have been occurring before you first noticed it. You may also recognize a pattern to the failures, e.g., they only occur at peak usage times or when you reboot the web servers in the middle of the night.

Important

Look for a correlation between any recent changes to your site content or configuration, and also any Keynote synthetic monitoring product releases or agent upgrades that might have been deployed around the time of the failed tests that might have caused (or fixed) the issue.

  • Determine the availability—Chart the availability of the test. Is the test failing during 1% or 100% of the runs? Is there any obvious failure pattern?
  • Determine the scope of the issue
    • Compare agents—Are the failed tests occurring on all the configured agents? Generally speaking, the more agents with failures, the less chance that it is an agent-specific issue, but there are exceptions. Again, be mindful that product releases and agent upgrades can affect multiple agents.
    • Compare tests—Is this issue occurring on other tests that are running in your account?

Step 2: Check the health of the agents that had the issue

Agent Status Dashboard—Dynatrace provides an operations dashboard that reports current Keynote monitoring agent status at http://status.keynote.com/agents/.

Step 3: Investigate the details of the failed tests

  • Screenshots on error—Error screenshots are often very helpful in determining the cause of an issue. If not already enabled for your measurements, enabling the Screenshots on error setting should be one of the first things you do after an issue is detected (there might be an additional cost, so please consult your Keynote sales person or account manager). With this feature enabled, you will receive a screen capture at the time the measurement failed and also a reference screenshot of a successful page load once a day. This feature also enables gathering the HTTP request and response headers for additional information.
    • Look at the screenshots for error messages like “Service not available” or “500 Internal Server Error.”
    • In the HTTP headers section, look for an object request that has no reply, which can be a good starting point for your investigation.
  • Diagnostics and instant tests—Running an Instant Test from the MyKeynote Diagnostics menu can help you diagnose the cause of an issue, especially if it is still occurring. The Diagnostics section also provides the Network Diagnostics tool.
  • Look for slow or missing objects—When you find an object that was not served or had a slow response time, the IP addresses for the object can be found in the MyKeynote waterfall charts. You can search on the web for “IP lookup” to get the owner of the IP address from which the object originates. Look in the name for “Akamai” or other recognizable Content Distribution Network (CDN) providers. If the object with an issue is being served by a CDN, you can ask the CDN to open a support ticket to investigate.
  • Check the component metric times of the failed tests—Open the failed tests in a waterfall chart and look for long bars. Always keep the relative time scale on the X axis of the graph in perspective so as not to be misled by the scale. Look for the bars that represent long First Byte Download times. These times are usually a sign of a site server under heavy load.
  • Check your server logs—You can search your logs for issues that occurred at the time of the failed tests.
    • Search your server and error logs for the user agent string “KTXN” for Transaction Perspective agents or “KHTE” for Application Perspective agents to find requests from our agents at the time of the errors. You might find issues that prevented or delayed serving the object in question.
    • Also, check your server metrics at the time of the failed tests for high CPU usage, memory usage, number of sessions, etc. If the metrics are too high, you might need to change the site configuration or add server capacity. You can also request a trial of Dynatrace Application Monitoring (AppMon) and/or User Experience Management (UEM) to gather detailed performance monitoring data for future incidents.
  • Agent IP addresses—If you should need the IP addresses of our monitoring agents (the location from where Keynote synthetic monitoring measurements are run), please open a support case. If your website requires white-listed IP addresses, you will need to add the address of each agent that is running your measurements.

Step 4: Contact Dynatrace to open a support ticket and check agent metrics

Contact your Dynatrace Guardian or open a support ticket. Ask to check if the agent that is experiencing issues is functioning within expected metrics.

Step 5: Workarounds

  • Timeouts—If the measurement is failing due to low timeouts, you can increase script timeouts as a workaround. You can download the scripts in question from MyKeynote on the > Current chart data > (select the script(s) > Download Script). Then open the scripts with the KITE recording tool (downloaded from http://kite.keynote.com). You can increase the script timeout value in script properties, but please note that timeouts should not exceed an average of 12 seconds per page. For example, if you have a 7-page transaction, then the script timeout value should be 84 seconds (7 pages x 12 seconds per page). The default script timeout is 60 seconds, which is sufficient for up to 5 transaction pages (5 pages x 12 seconds per page = 60 seconds). If the script timeout is increased beyond an average of 12 seconds per page, additional costs will be incurred—please consult your contract, sales person, or account manager. Also note that if a real user has to wait 30-60 seconds for a page or control to render, chances are they will abandon the page. That is a bad user experience; focus should be placed on improving the performance of the website or web application rather than just increasing the scripted timeout value.
  • Validate at the end of each step—We recommend adding text validation at the end of every step, if possible. At the end of a page load, text validation evaluates page content to determine if the expected page content was loaded. You can define required text that you want to see or error text that you do not want to see and should result in an immediate measurement error.
  • Switch agents—If you have not found the cause of the issue, one possible workaround is to move the measurement to another agent, preferably in the same city or geographic region. This does not fix the issue but can temporarily meet your monitoring requirements. Follow up with Dynatrace Support to see if are any real world issues on the network or check http://status.keynote.com.
  • Switch to a different browser—There are many reasons to switch to a different browser, for example, your website no longer supports the browser version used by the Keynote monitoring agents. Therefore, you can try to change between the available Internet Explorer, Firefox, and Chrome browser options.

Issue: Missing data points from charts in the portal

Possible cause

  • Your measurement could be hitting the hard timeout of 300 seconds and being automatically terminated. The Dynatrace support team can query back-end data to determine if measurement runs timed out and were “tagged” (i.e., hidden from MyKeynote).

Less common causes

  • Possibly caused by an issue with the agent, such as the global measurement scheduler being unable to execute your measurement instance. To investigate these issues, please contact Dynatrace Support.
  • Miscellaneous errors can occur due to conditions that create an uncharacterized error. Make sure your MyKeynote preferences are set to show miscellaneous errors ( menu > Settings > Charts > View miscellaneous errors in Scatter Plot graphs).

Issue: Missing alerts

  • Make sure that MyKeynote alarm configurations are set to accept emails from outside your company’s firewall.
  • Check that the desired performance or availability alarms are enabled in MyKeynote alarm configurations.
  • Check if the expected alert is present in the MyKeynote Alarm Log.

Keynote Synthetic Monitoring terminology

  • Account = Collection of tests
  • User login = Credentials that allow you to login to a portal, recorder, or community site
  • Script = Steps or workflow that you want to test
  • Measurement = Script + configuration (e.g., frequency, agents, browser)