Its Velocity Time and the people who care about Performance, Continuous Delivery and DevOps are gathered in sunny Santa Clara, California. Thankful to be here I want to share my notes with our readers who don’t have the chance to experience it live. Lets dig right into it!
Interview with Steve Souders himself 🙂
I got the honor to get Steve Souder, the founding father of Web Performance Optimization and the host of Velocity, in front of my PurePerformance Microphone 🙂 -> we chatted about metrics, metrics and more metrics. Check it out:
Wednesday, June 22nd
Keynotes on Wednesday, June 22nd
There were several keynote presentation. Two that really struck me were from Bruce Lawson from Opera and Artur Bergman from Fastly. I also really enjoyed the talk from Richard Cook on Poised to Deploy.
Making bad ads sad. Rad!
I kept it to meeting notes – I hope this is OK for my readers:
- com: If you load this page once a day for a month you spend $9.5 to download ads just on your mobile data bill!!
- Fact: 2/3 of people in India and Indonesia block ads
- Fact: #1 browser extensions are ad blockers
- Revenue model of these extensions are selling “a spot on the white list”
- Opera has a built-in ad-blocker plus comparison feature with and without ads!!
- 7% data savings for end users when blocking apps. 10% savings for companies
- Why are ads so slow? Redirects, slow responses, DOM modifications, …
- Good Ads that always work: native (image + text natively in the html) + video
- IAB: Follow the LEAN Guidelines: lightweight, encrypted, ad-choice supporting, non-invasive
- UX pays dividends: 140% more revenue generated from users that have good user experience.
- Fact: Ad blockers are going to double in the US by 2020: you have 48 months to innovate -> do it or die
Artur walks us through the very painful story when Fastly had to go through a DDoS attack earlier this year. The audience could feel the pain – and I think the main messages I took from it:
- It’s not your fault
- There is no timeline
- Prepare for it
- There are different motivations for DDoS
I really enjoyed his talk – make sure you watch the video once online. Warning: there is a small bit of profanity in the video.
Session: Using Machine Learning to determine drivers for conversion and bounces
Pat Meenan and Tammy Everts on how Google and SOASTA partnered up to build a machine learning system to analyze and predict the performance impact on conversion and bounces. They achieved a 96% accuracy with their data and machine learning system. The most interesting part of the presentation was about which metrics they found out correlate the best to conversion rate and bounce rate. This is great input for new best practices on building web sites. Most interesting was that DNS lookup and especially Start Render metrics had almost no correlation with Conversion or Bounce Rate!
Key items when they built their system
- Balancing the data is essential
- Validation data: train on 80% of the data; Validate on 20% to prevent over-fitting
- Machine learning works best on normally distributed data -> makes me wonder whether real user data is really normally distributed!!
- Data they have in their beacon, e.g: domain, time stamp, SSL, user agent, Geo, Bandwidth, Timers, custom metrics, HTTP Headers, user session information, …
- Metrics correlated best with conversion rate? # of elements, # of images on page, # of scripts, front-end load time, full page load time,
- Metrics correlated best with bounce rate? DOM load, full page load time, # of page elements, front-end load time, # of scripts, back-end load time
Findings on data they analyzed
- #2: the more complex a session (the number of steps) the worse the conversion -> so – the longer user have to be on your page the less they convert -> makes sense!
- #3: # of images: the fewer images on a page the better the conversion -> make it lean! Follow general WPO best practices
- #4: The higher DOM Ready time the higher the bounce rate -> makes sense -> the longer a page takes to load the more people will drop out!
- #5: The higher Full Load Time the higher the bounce rate -> makes sense again -> just as high DOM Load Time
- #6: Mobile specific metrics weren’t meaningful predictors!
- #7: Conventional metrics (DNS, Start Render) are NOT important at all and don’t correlate to real conversion rates
- Do this with your own data
- Gather your RUM data
- Run the machine learning against it
Session: Wild West in Media Performance: A Vox Story
A story of developers working in a legacy codebase – could be very interesting 🙂 -> here are the session details.
Here are my notes from this really very informative and very practical session:
- Started with SpeedCurve: liked it that is based on WebPageTest – as they also use their own private WebPageTest cluster
- After deployment run a set of SpeedCurve tests to validate that the deployment really worked
- Tip: use Git Annotations for deployments
- Pro-Tip: wait for “cooldown” after a deployment. Right at the deployment you will see some strange perf data. Wait until the deployment runs stable. I assume these anomalies are caused by empty caches, …
- Introducing Lighbike – their own Resource/Performance Budget tool. Dont allow any commits or pushes to production if outside the budget. Because: once it is in production it is hard to get it out. Thats why they prevent this in dev already!
- Use RUM and Synthetic: you need both!
- Tips on Images: Huge Performance Improvements with WebP Image Formats;Preload via Picture Element; Better image quality control; https://github.com/thumbor/thumbor
- Fonts: Using https://fontfaceobserver.com/
- Ad Testing: they test every Ad and mark them with Green, Yellow, Red -> depending on their performance impact
Session: Taking back control over Third Party Content
Yoav Weiss from Akamai giving us an overview of the nasty business with third parties – such as Ads.
I suggest you get the video and slides from his talk. Lots of very interesting information on how to speed up your pages when loading third party code. He also addressed some new W3C working groups on e.g: controlling feature-set of browsers to speed up or make page more secure by preventing certain things like scripting of third party code. Long story short: make sure you read up what Yoav has to say – he clearly has a lot of know how on that topic. Fun fact: seems Yosi likes Lego’s 🙂
Session: Operational Visibility on Global Scale
Sangeeta Narayanan from Netflix is going to tell us about how Netflix is troubleshooting on a global scale. Identifying issues that their > 81M subscribers have.
Here are my notes:
- Making sure they have DELIGHTED users – but how to measure delight?
- You need Business Insights AND Operational Insights
- Word on the Street: “Netfix is a Metrics Generation Engine that also Streams Movies” 🙂
- Introducing us to their monitoring platform that they built
- Stream Processing Platform Mantis -> more on their blog: http://techblog.netflix.com/2016/03/stream-processing-with-mantis.html
- On-Demand Metric Definition -> allows them to turn streaming events into metrics they can analyze without changing any instrumentation. Its built into the platform
- Anomaly Detection: Looks like they are building what many other tools have as well -> providing an “analytics” option to diagnose and detect patterns
Overall I have to say that I expected a bit more. Netflix has been producing a lot of great content and stories over the last years. So I guess it is OK that this session may have not been what I was expecting!
Thursday, June 23rd
You got your design team in my security team!
Eleanor Saitta from Etsy on the fact that security has to be system property.
Security issues are either those that shouldnt exist, DDos, those that are math problems, those involving people
- Shouldnt exist: Try Langsec today!
- Security is the set of activities that reduce the likelihood of a set of adversaries successfully frustrating the goals of a set of people we like
- Kill your ego before you kill your user: what matters is what you enable people to do – not exactly what you or your team does
- Security is just another system property
- The role of security at Etsy is to be everyone’s friend and make sure engineering can focus on what they have to do!
There was much more in her talk. Follow her online in case you want to make sure you have the right focus on security!
Microsoft is irrelevant to me. I use a Mac”: Is this you?
Quick 5 minutes pitch on how Microsoft has started to change over the last couple of years.
Turning high-velocity data into leverage for people
Ozan Turgut from signal fx giving an overview of what they are doing with measure analytics. Showing some very interesting visualization of metrics but more intersting is the approach to proactively alert in potential upcoming issues to answer questions such as: Top hosts by CPU in the last hour? which hosts will soon run out of disk space?
Measuring What Matters
Stephen Ludin, Chief Architect @ Akamai talking about their Innovation Team trying to stay ahead of the curve. Interesting stat: only 25% of the Top 1000 websites make use of the W3C User Timings. Call to action from Akamai: figure out how to make monitoring through W3C easier!
Performance is about more than time series charts
Buddy Brewer from SOASTA proposing to not only look at time series and waterfalls. The question is not “What to fix?” but “What to fix first?!”. Try to find the top pages that matter most for your users (those with the biggest hit count) and focus on improving those first! Lots of interesting ideas on visualization as well as about predicting performance and user experience!
An open platform for government innovation
Diego Lapiduz on the challenge of trying to bring change to government agencies. Interesting story on how they tried to innovate and ended up building cloud.gov as the platform based on CloudFoundry that they can now use to run and deploy apps easier – avoiding the previously painful process of getting deployments approved as it had to go through many manual regulatory checks.
Lessons learned: learn what your users want; Ops in government is hard; Automation is key (for security, compliance & happiness); A solid platform can change your ops; Stay foolish
DevOps: More than just Automation
David Hayes from PagerDuty presenting a survey they did with leading devops teams around the world.
Most interesting answers:
- Regular Ops: 85% missed an alert
- DevOps Teams respond faster on outages
- Organizations transitioning to DevOps: 80% felt like they somehow fail!
The state of the W3C Web Performance Working Group
Todd and Philippe on the history and the current state of W3C. Check out the information from the W3C WebPerf group online. They are reaching out to developers to understand what else the group can define to be measured in the future. The biggest challenges right now are privacy concerns as a lot of data capture is seen by some folks as potential compromise of user privacy. The other issues they face is adoption by all browser vendors.
They want YOUR feedback on the following items:
- We want to know what is still hard to measure
- What browser differences make measurement difficult?
- What knobs are missing on existing features?
Robust anomaly detection on real user data
LinkedIn on how they detect anomalies. Started off with a pretty interesting graph on their response time over the course of a year. Their improvements they made but they basically had no impact in the end as they had a latency leak which voided their improvements over the course of the year. That’s why they invested in detecting these anomalies automatically.
- Avoid “Alert Black Hole” as people that are annoyed with the alerts will move them to a folder and nobody looks at them
- Metrics for anomaly detection: Business (PageView, Engagement, Geo, .. ) and Operational Metrics (CPU, GC time, Latency, …)
- Send the right information to the right people at the right time. Don’t send too many alerts; Send Root Cause data to people that can handle it
- Introducing their Anomaly Detection project Luminol -> find it on GitHub: https://github.com/linkedin/luminol -> really interesting approaches
- Dont alert on static threshold, Alert by comparing with baseline from same timeframe in the past
- Root Cause Detection: if you have domain knowledge its going to be easier to identify what causes the anomaly -> correlate metrics
- Not all pages are equal: different confidence settings for alerting depending on “importance of pages”
- User Feedback: allow users to override anomaly detection and provide feedback to the system -> was this really an anomaly??
- Tip: Monitor your anomaly detection: how many false positives do you have? Optimize it to reduce them!
- Some great real time use cases of anomalies they had at LinkedIn
- Tips: Understand your use case; make your choice / trade-offs; Anomaly Detection is a sexy buzz word -> BUT: Not a Silver Bullet – Embrace the imperfection!
Part of the pipeline: Why continuous testing is essential
One of my most anticipated sessions as I have seen Adam Auerbach (@bugman31) present on what they did at CapitalOne in the last year around changing the way they deliver software. I keep using his examples from last year in my own presentations – so – this time I want to a) see what is new with Adam and b) finally thank him about sharing his story.
Adam is joined by Tapabrata Pal to explain why testing has to be part of continuous delivery – part of your pipeline:
Here are my meeting notes:
- The question is not What is DevOps. It is: Why DevOps? Deliver High Quality Working Software Faster
- A pipeline IS NOT a pipeline if you dont have testing!!
- We used to have a “Hardening Sprint” because we didn’t have Continuous Testing. Now we have real Continuous Delivery because we constantly continuously test -> no need for a hardening Sprint!
- Challenges for Continuous Testing: Devices and Browsers, Test Environments, Test Data, Dependencies
- We replaced our traditional testing to do everything with Docker + Open Source Test Tools, e.g: Docker with JMeter in AWS, Docker with Selenium in AWS -> we dont need a large test lab with lots of load generation machines sitting idle when not needed. The cloud makes this possible!
- Test Data Challenge: get high quality data on-demand to the feature teams!
- Breaking Dependencies: we use Service Virtualization to allow “defendant” teams to test without having the other teams service/app/component available!
- For frontend testing: Purpose is that UI renders correctly on all devices. So we can virtualize the backend as this is tested by other tests to focus on the frontend only for our frontend tests. That is much faster and allows us to increase test coverage on that manage devices
- Architect your Test Suites: figure out which tests to execute when. Goal is to fail fast in case there is a problem. Run fast API tests early
- Hygieia: Their open source project on “Tracking health of your software delivery pipeline!”. Really cool that they open sourced it!!
- Best Metrics for the efficiency of the Pipeline: Metrics showing the Build Flow, e.g: Avg Wait Time of a Build in a Pipeline Phase -> which phase do I need to speed up as my build times are impacted
- Tester Role is changing: you have to level-up; you need to understand how to interact and work with cloud environments; how to work with the pipeline; learn scripting; automate;
- How to Scale Continuous Testing: Building Testing Guilds; Office Hours; Open Spaces; Internal Conference
Last session of the conference on real world examples with HTTP/2 from Akamai.
- 300+ customers using H2 in production
- Tip: Use RUM to monitor the improvements – problem is that you will have H1 & H2 results – hard to differentiate and therefore not good for comparison. Therefore: do proper A/B testing
- Finding: Browser that support H2 show different behavior
- Test your browsers support for H2 by going to https://http2.akamai.com/demo
- Interesting to see tests comparing H1 & H2 in Chrome and FF for different pages. Rendering across the board is faster on H1
- Performance Rendering Tips:
- Do we still need domain sharding? NO! He gave some great examples and tips on how to “migrate” from H1 to H2 or better said “how to best support both worlds”
- Tip: Put critical components on the same domain!
- Tip: Use progressive jpgs!
- Tip: Large “hero” images will hog all the available bandwidth -> will slow down loading of other components
- Server Push: is best used to push content when network is idle!
- Interesting comparison examples with small files: check out his slides with the comparison between Chrome and FF on both H1 & H2
- Image Sprites: keep using sprites for performance; mobile decoding times can be slow; think about frequency of change
- Anti-patterns in H2 are still anti-patterns
- Keep combining files
- check rendering times carefully
- keep using sprites (if you use enough images)
- if you have to shard reuse the same connection
- Mare sure critical content is on the same domain
Great talk – very informative. Lots of great test results. Great closure of another great Velocity!