There are many advantages of breaking an application into smaller services. When APIs and Interfaces are well defined it allows more independent development on a separate code base, keeping risk low to break the whole app with a single code change. It allows for more flexible and scalable deployments when done right and it is theoretically possible to replace services when a better service is available that provides the same functionality.
Starting from a clean slate has a higher chance of success in my opinion whereas breaking an existing application monolith into smaller pieces by extracting individual features into services is often bound to fail. The following screenshot shows a Dynatrace Transaction Flow captured on a distributed .NET Application which was recently migrated to a more service oriented architecture (SOA). The page analyzed is a search result page displaying details for 33 result items. The architectural, scalability and performance problems are easy to spot by analyzing the key metrics highlighted in that view!
The example above already highlights a very common problem scenario when splitting applications into services. The end result is a very high number of service calls to achieve a “simple task”.
Before this application was split up into services the original implementation already had some flaws. It executed SQL to retrieve a list of result item ID’s. The code then iterated through the result list of IDs and queried additional details for each item making it an extra database roundtrip. That access pattern is referred to as the N+1 Query Problem (find a great explanation this patter on phabricator.com).
Instead of fixing that problem the decision was made to simply split the current code base into services. After the migration the initial SQL to retrieve the list of item IDs is still executed directly from the frontend implementation (you can see this in the Transaction Flow by following the line from the first ASP.NET node to the database – 6 SQLs in total). As the “Give me the Item Details” was extracted into a service we now see 33! individual service calls – one for each result item. Each service call seems to execute ~5 SQL calls to query the details for an item. Why 5 and not just 1? Because some of the common code base was also migrated to query a lot of context and configuration data each time the service is called. This implementation is now much worse than when they just had to deal with the Classical N+1 Query problem. Especially when you consider that making a service call requires the caller to open a physical connection to the service, marshaling data and blocking threads while waiting for that remote call to finish. If these services get deployed to the cloud you also need to factor in latency, bandwidth and the costs involved as many cloud providers charge by transmitted data.
Now let me explain why I believe this happens and what I believe developers, architects and also testers and operations need to do to prevent this from happening!
Bad Architectural Decisions: But Why?
If you have additional reasons or disagree with my analysis let me know. I know many of you out there have seen these situations and have a good understanding why decisions were made that way. Here are my top reasons:
- What makes a Service a real Service that can be used in a SOA implementation? Core features such as a clean separation of concerns, proper state handling, marshaling and the ability to scale these services by adding more service instances. However, simply chopping up your codebase into smaller pieces and putting them into a service-context doesn’t give you this core functionality.
- Many architects do not consider the dependencies of existing code to shared resources and data. I’ve seen SOAs where each service called within a transaction had to query the same piece of configuration data separately from the database after getting rid of the shared configuration space that existed before.
- What used to be local method calls are now remote service calls. If you are lucky, the service is hosted on the same physical machine. But it might sit on a different virtualized box or even far away in a cloud instance of Amazon or Microsoft. Now you are adding impact of the network, latency and the transportation protocol to the mix.
Top SOA Metrics to Identify Bad Architecture
If you have followed my previous blog posts you know that I am a big fan of performance and architectural metrics. If we look at the same Dynatrace Transaction Flow again I can tell you the metrics I look at to identify bad SOA implementations early on. And the good news is that we can do this already on a developers machine during a manual -, automated- and integration test. No need to wait until you figure out the apps are slow in production or during your first large scale load test:
Now let me explain these metrics:
# of Service Calls: How many services do you call for a single use case and how many of the same services do you call? Can you see the similarity with database access patterns? If you call the same service many times then this use case might be better supported by a new single service, e.g: “GetSearchResultsWithAllProductDetails” instead of just a single “GetProductDetails” which gets called for every product on the result page.
Bytes Sent/Received: Calling remote services, from a coding perspective, has become as comfortable as calling a local method. But how much data do you actually transfer? Bytes, Kilobytes or even Megabytes for a single call? What does that mean if you move your service to Amazon, Azure or a different data center that is far away from the caller? Size immediately becomes a cost factor if you have to pay e.g: Amazon for your data transfer volume. So – be smart and make sure you keep your transferred data slim.
# of Worker Threads: Here I look at both the Caller but also Callee side. Depending on whether you call these services asynchronously or synchronously you will occupy your precious worker threads. Check out my blog on analyzing multi-threaded application problems.
# of SQL Calls: Ask yourself why you need to execute 500, 1000 or more SQL statements to query product details. A simple coding mistake? A configuration problem with Hibernate? Or maybe a good candidate for a stored procedure? Check out my Database Access Metrics blog for more details.
# of Same SQLs: If the same SQL is executed all over again you either have the classical N+1 Query Problem or you have a good candidate of static data that you should rather cache instead of retrieving it every time from the database when you need it.
# of DB Connections: a common mistake is to execute each SQL query on a separate connection – losing the ability to prepare statements. Don’t believe me? Check out this blog post: Top Performance Problems found in C#.
How to capture and identify good and bad patterns!
If you look at the examples above you understand that you don’t need a large load test or production load to find these problems. You don’t have to wait until the app is slow and your production monitoring system tells you it is slow. If you do your homework you can identify all these problems either on the local developer workstation, on your manual testing box or in your automated integration test suite executed by your Continuous Integration server. Here are some tips on how to do this:
Developers: you should use your built-in profiling tools of your IDE. If you are testing end-to-end scenarios where multiple services are involved then use one of the freely available performance monitoring tools such as Dynatrace. As you don’t want to check this manually all the time I also suggest you hook up your Integration tests with your profiling tools. If you can spot that the number of SQL Statements just went up 10x due to a code change and you can detected that by looking at the # of SQL statements executed by your unit test then you safe a lot of time later on with debugging, profiling and fixing code! Check out my YouTube Tutorial Dynatrace for Developers if you want to learn more.
Testers: For you it’s time to leave the comfort zone of your testing tools or your manual testing approach. Don’t be scared – it’s simple. If you are doing manual or automated tests just install the same tools your developers use on the app you are testing. Additionally to reporting “Functional OK” or “Functional FAILED” you can level-up your reporting by saying: “Functional OK but Architectural FAILED”. All it takes is looking at e.g: the Dynatrace Transaction Flow I’ve shown above and look at the highlighted values. This is an easy sanity check. Plus! If you have that data capture it is easier to communicate these hard facts with your developers as they trust these technical monitoring tools that provide that level of detail for every single test you execute. Feel free to check out my Basic Diagnostics and the Dynatrace for Load Testers YouTube Tutorial.
Continuous Integration: If your developers and testers wrote Unit-, Integration- or Functional Tests you can easily integrate these test executions with the same tools mentioned above. The key here is to pick tools that can automatically analyze these measures for every single build and test execution and also automatically detect regressions. You don’t want to sit there and manually analyze the metrics of hundreds or thousands of tests that you execute per build. Check out our Dynatrace with Jenkins integration to automate regression detection in your CI. If you use other build servers check out my Integration of Dynatrace in your Infrastructure YouTube Tutorial.
Ops Team: Truth is that we can’t test every single scenario we face in production. Therefore you need to monitor your production environment and check how many service calls are really happening. You want to specifically keep an eye on transferred data and the response times of these services. When you deploy to a cloud provider this becomes even more critical and important as you have all the data to report actual costs of these services for individual applications and features. The business side will love you for this data. And engineering will hopefully like you as well if you show them how the app really behaves in the real world. Here is a nice dashboard one of our users just shared with me – especially interesting the consumed resources such as CPU, Threads but also impact on failure rate and Response Time:
Which SOA mistakes have you encountered?
Now – this was my view of the world; my explanations on why I think we see SOA infrastructures like the one highlighted above and my ideas to prevent these problems right from the beginning. I often get challenged with my statements – especially when speaking in front of practitioners when I give presentations at meetups or conferences. Feel free to challenge me as well – I am always eager to learn.
If want to look at these metrics on your own machine feel free to sign up for the Dynatrace Free Trial. You get 30 days to monitor 5 different machines – after that it stays free for you to use on your local machine. This makes it a perfect tool for developers and testers to keep.