Sunday, February 05, 2012

Scala Talk

The slide deck from our Scala talk earlier this week in Bern is available here: http://guild42.ch/2012-01-30/presentation.pdf

Tuesday, April 05, 2011

ContainerLess

I've recently worked on a project where we built a relatively small web application for a client. There was a hard release date, which gave us less than 4 weeks to build and deliver the application. There were several things we did to ensure that we would be able to release on time. One of the them was making sure very early that we could deploy the new application into a production-like environment. During the first few days, we therefore invested time in building infrastructure to deploy the application automatically and continuously (i.e. after each build) to a production-like environment. Going through this process was useful in several different ways:

  • it made us think early on in the project, what will be required to run the application in production (e.g. operating system, software, ports, certificates, etc.). This gave us sufficient time to react to unexpected obstacles;

  • by automating the deployment process early, we reduced the number of times we had to do this activity manually later on;

  • seeing the application work in a production-like environment was encouraging feedback that we would be able to release to production whenever our client required;

  • deploying the application while it was still very simple allowed us to focus on the deployment process without getting distracted by problems caused by the application.

Ditching the Container

When we first tried to deploy the application, it was not much more than a Hello World web application. We decided to use Jetty as a web server, since we had used it many times on projects before - both standalone and embedded. Whenever we used Jetty in the past, however, we would normally build a WAR file or a WAR-like directory structure, write a web deployment descriptor and finally deploy or load the artefact into Jetty.

Having spent a bit of time working on a Ruby on Rails application recently, the deployment process used in the past seemed cumbersome and complicated - even more so given the simplicity of the application at hand. That's when we realised that what we cared most about is the application itself and to a much lesser extent the container. In fact, we didn't care about the container at all. All we cared about was that our application would be able to serve incoming requests. At some point, we started to think of the application more of as a self-contained program, which just happened to expose some of its functionality via HTTP. The diagram below illustrates this mind-shift.

Consequences

The biggest change was the mind-shift. It took some time to get used to the idea that the application is no longer "contained" in something else, but that it can be "self-contained". But there were also a number of other practical consequences that this change entailed:

  • simplified development process. It was very easy and fast for developers to run the application in the same way as it would run in production;

  • simplified testing. It was very easy to start and stop the application for automated functional tests;

  • simplified packaging and deployment process (e.g. no web.xml, no war file, no prescribed directory structure, etc.);

  • simplified application start (i.e. simply execute a Java program);

  • started using standard DI mechanisms to wire up Jetty as a dependency and injected it where needed.*

* Strictly speaking, you don't need to ditch the container to do this but we felt that the mind-shift we had gone through enabled us to see the potential for doing this.

Conclusion

Making this architectural change simplified our development and deployment process. It also simplified the way we thought about the application. The only price we paid was loosing the ability to deploy the application to different web server without making changes. Fortunately, this price was purely hypothetical because we had absolutely no need nor desire to use a different web server in production. On the positive side, we were now hiding the fact that we were relying on Java Servlets to realise our web functionality. In fact, seeing what some of the other communities are building (e.g. Sinatra, Node.js, etc.), I'd be tempted to try to write a web application that doesn't use Servlets at all. Part of it for fun, part of it to overcome some of the limitations inherent with the Servlet model.

Thursday, September 02, 2010

Agile Talk in Bern

Next Monday (September 6th), I'm giving a talk in Bern. I'll be talking about how our software delivery teams around the globe work and get stuff done.

The talk will start at 6pm at Restaurant Schmiedstube. More info and registration via Guild42.

Come along if you're in the area!

Sunday, April 11, 2010

Using Spring Java Config to wire up Dependencies

The following tests demonstrate how, in Spring, you can configure your beans explicitly using Java. I wasn't sure if the internal method calls would result in two instantiations of the dependency or not. Clearly, it works as expected, i.e. only one instance is created because the default mode for instantiating beans is singleton scope. The magic is called cglib.






Declarative Replacement of Annotation-Driven Bean Definitions in Spring

If you are using Spring 3 and configure your dependencies explicitly in a @Configuration class, it's fairly easy to replace one dependency with another. All you need to do is declare two dependencies with the same bean name in two separate config classes. The beans of the config which is registered last to the ApplicationContext, will simply override the beans defined in previously registered config classes.

Things get a bit more tricky, however, when you try to override a bean that has been wired up automatically using annotations. In this case, simply registering a config class that contains a bean with the same name as the annotated dependency doesn't yield the desired effect. It's very well possible that using a JavaConfigApplicationContext might solve the issue but while JavaConfig integration with the latest Spring 3 is flakey, this seems not an option. In the meantime, here's an example that demonstrates a solution to achieve the desired behaviour:




Friday, November 13, 2009

Lessons Learned (Part 2: Performance Testing and Garbage Collection)

This is a continuation of my previous blog post. The goal is to sum up some lessons that I've learned during the last couple of months while I was involved in performance tuning a large-scale distributed web-app.

Golden Rules

Although there are probably many more rules and good advice out there, these are the ones that I remember off the top of my head as being important:

Change one thing at a time
I often found myself very tempted to violate this rule. The problem with breaking it, however, is easily illustrated with an example: Imagine that you make two changes, c1 and c2, at the same time. If c1 results in a performance improvement of 20% and c2 in a performance penalty of 30% you'll get an overall performance deterioration of 10%. Consequently, you'll decide not to implement any of the changes, even though c1 on its own would have resulted in better performance.

Look at the system as a whole and fix the slowest running part
Even if you can make some part of the system thousands of times faster, it will not affect you application performance if the part you changed was not your primary bottleneck. For example, it doesn't make sense to optimise application code if the bottleneck is the result of a slow running database query. I'd even go as far as saying that it is harmful to optimise parts of the systems, when it's not needed. Firstly, it's a waste of time that could be used for tasks that provide more value. Secondly, making performance optimisations often introduces additional complexity at the code level. If you can't justify this extra complexity with a significant performance boost, don't do it. Of course, I'm not advocating against common sense and sound software design principles. For example, I know that making lots of fine-grained RPC calls is a bad idea, so I'll avoid it in the first place.

Don't optimise prematuerly, i.e. without measuring
This one probably goes hand in hand with the rule above. Don't optimise unless you can prove that it will have an effect on overall system performance. Again, this rule is not an excuse for not using sound software design principles.

Performance Testing Cycle

Keeping the above rules in mind, we continuously iterated through the following cycle:
  1. Measure performance
  2. Identify single bottleneck (i.e. pick lowest hanging fruit)
  3. Fix single bottleneck
  4. Verify performance has improved
Once step 4 is complete, the cycle restarts. Sometimes, we would loop through this cycle several times a day. Other times, one loop would take us several days or even weeks. This process essentially continued until our release target was reached.

Measuring Performance

We used JMeter to generate load against the application under test. We set up the tests so that the generated load would increase over time and therefore put the application increasingly under more stress. While running the tests, we measured a number of parameters. The most important ones were throughput, average response time and CPU utilisation.

Looking at charts similar to the ones shown below, we got a fairly good understanding of how much load the application under test could handle.

In the above charts, for example, you can see that, at some point, application throughput reaches a plateau while the average response time per transactions continuous to grow. At this point, the application reached some physical or logical limit that prevented it from doing more work. The challenge, of course, is to find out what those constraints are in order to increase throughput or reduce response times.

Identifying Bottlenecks

Bottlenecks created by hardware constraints are normally quite easy to identify. Usually, the symptoms are maximum CPU utilisation, reaching network bandwidth limits, etc. The solution is often to change and restructure application code. Identifying bottlenecks not directly created by hardware constraints is more difficult. Likely causes are slow running external systems, resource starvation, suboptimal configuration settings, etc.

In the last project, we eliminated the hypothesis that slow running external systems are constraining our system quite early by taking them out of the equation completely and using stub implementations instead. At the same time, this made our performance tests much more robust, reliable and faster.

Fixing Bottlenecks

In the first few weeks of our performance tuning initiative, we made quite a lot of progress. There were a large number of easily identifiable bottlenecks which were relatively trivial to fix. These included simple programming errors, unnecessary database calls, unnecessary network calls, slow running SQL queries, no caching where data was easily cacheable, concurrency issues, etc.

After some time, however, it started to get more difficult to identify bottlenecks. In particular, there has been one case that I think is worth writing about.

Garbage Collection
We had already spent several weeks trying to identify a bottleneck, which was not obviously caused by hardware constraints. Here are the things we noticed:
  • Throughput reached a plateau at point t
  • Response time grew significantly at the same point t
  • Hardware was far from being exhausted. CPU utilisation, for example, was about 60% at point t
  • Although total CPU utilisation was around 60%, one (of eight) cores was maxing out occasionally
The last point was indicative that there was probably a CPU-intensive task executing in a single thread, hence single core. One such task that we could think of was garbage collection. We verified this using Perfmon and found that GC was indeed taking up a large amount of processing time (up to 30%).

As a result, we did some reading on how .NET GC works. We've learned that, by default, the GC is optimised for standalone apps running on single-core machines (called Workstation GC). On multiprocessor machines, however, there is an additional GC mode available (called Server GC). The difference between the two is basically that the latter creates a separate GC heap and GC thread for each processor and that collection occurs in parallel. Here's the change we made to our configuration:

<configuration>
<runtime>
<gcServer enabled="true" />
</runtime>
</configuration>

After making the above configuration change, the throughput of our application increased by almost a factor 3! At the same time, we were again reaching 100% CPU utilisation and average GC time was down to 2-3%.

Of course, this dramatic change meant that we were dealing with a completely new application profile. Consequently, we restarted our iterative cycle described above again from beginning in order to find the next bottleneck.

Conclusion

The fundamental prerequisite for doing effective performance tuning is to have a set of repeatable and reliable performance tests. Ideally, these tests are easy to execute, finish in a reasonable amount of time and give you rapid feedback with regards to how the application is performing. Also, you'll need an isolated environment, which allows you to deploy new versions of the application easily and frequently. This gives you a good platform to experiment with changes. Measuring the difference between these changes with respect to the overall application performance then gives you the ability to make informed choices.

RSpec and TextMate

Just struggled to get the RSpec bundle working with TextMate. After installing the bundle as described on the RSpec site, I kept getting the following error message:

Library/Application Support/TextMate/Bundles/RSpec.tmbundle/Support/lib/spec/mate.rb:4:in `join': can't convert nil into String (TypeError)

After a few hours of fruitless searching on the web, it occured to me that I should maybe have a look at the file listed in the error message above (mate.rb:4).

File.join(ENV['TM_PROJECT_DIRECTORY'],'vendor','plugins','rspec','lib')

The bundle basically failed, because the TM_PROJECT_DIRECTORY variable was not set.

So, instead of trying to run the file in standalone mode, I pulled the spec file into a new project (⌃⌘N) and ran it again.

Et voilĂ . No errors this time round. ...and thanks so much for the helpful error message, RSpec-TextMate-bundle! *grrr*

Thursday, October 29, 2009

Lessons Learned (Part 1: Remembering Waldo)

Intro

I've spent the last couple of months trying to help improve performance and scalability of a large web-based system. Initially, the application could barely handle more than a handful of concurrent users, which was far away from the launch target of several million users per day. This will probably be the first in a series of posts, in which I'd like to talk about some of the more interesting challenges we've faced.

A bit of context first: The application was written in C#. The main technologies involved were WCF, MSMQ and SQLServer. Roughly speaking, the app consisted of a presentation tier (IIS/MVC), a business logic tier (WCF) and a data tier (SQLServer). Application logic was exposed through a large number of WCF service endpoints. Each service endpoint, in turn, exposed a similarly large number of fairly fine-grained service operations. Essentially, there were three groups of clients that consumed the exposed services: the presentation tier, other WCF services inside the business logic tier and a number of mobile clients (which I won't talk about here).

Intra-Tier Communication and Horizontal Scaling

Inside the business logic tier, there was quite a lot of communication going on between the individual WCF services. Initially, many of these calls were routed through the WCF stack. The rationale behind this initial design decision was so that - if needed at a later stage - some services could be run individually on separate machines.

It seemed unlikely, however, that this would ever happen. The logical conclusion of this thought was that services would communicate with each other via network calls, even though they would be running inside the same process. And even if the WCF service layer would be partitioned and distributed onto separate machines, what would happen if this still wouldn't give us the desired performance? Imagine you've got three WCF services: s1, s2 and s3. Assuming that s3 is the most hardware-hungry one, we could deploy s1 and s2 together on one machine and s3 separately on a dedicated machine. What happens, though, if the hardware onto which s3 is running is still insufficient? At this point, we could start scaling the service out horizontally by adding a load balancer and more machines each running a copy of s3. So, if we probably need to scale out horizontally anyway at some stage, what's the point of adding the overhead and complexity of network calls between services if they can be run in same address space? To emphasise this point, we measured how many WCF service calls we can make in a given period (using net:TCP binding) and compared this against making direct in-memory calls to the same service instance. The not unexpected result: throughput for the latter was about 350 times higher. Consequently, we went through the codebase and replaced WCF service calls with normal method invocations wherever possible.

In a recent email conversation, my colleague Martin Fowler drew an interesting analogy: "It's interesting that there continues to be this desire to distribute different functionality onto different nodes in the name of scalability when often the better route is to put all nodes in the same process and cluster the resulting app. This was exactly the wrong thought that distributed objects suffered from."

Inter-Tier Communication and Horizontal Scaling

The presentation tier was physically separated from the WCF services running in the business logic tier. Consequently, communication between the two tiers had to happened over the network.

Let's go back in time a little. Back in 1994, Waldo et al. wrote their excellent seminal paper called "A Note on Distributed Computing". In it, they argue that there are fundamental differences between in-process and intra-process calls in terms of latency, concurrency, partial failures scenarios etc. In the past, RPC systems have tried to abstract these differences away and make developers believe that there's no difference between calling an object in the same memory space or executing a procedure on a remote machine.

Don't get me wrong, I think that WCF is actually a pretty cool platform but, unfortunately, it also encourages people to continue building RPC apps in cases where other solutions might be more favorable. In fact, it makes it horribly easy to take a bunch of classes and expose them as remote objects. Calling them services doesn't mean that your application has now magically become service-oriented. Also, it doesn't change the fact that these now-called services are still remote objects including all the flaws that Waldo talked about.

Indeed, we've had to fix a lot of code where developers happily looped over hundreds or thousands of items in order to retrieve some data, unaware that in each iteration they were making a network call. It goes without saying that this had a significant impact on the performance of the system. After removing all unnecessary calls, we measured the overhead incurred by network communication again. On average, still more than 30% of our total service execution time was network overhead (i.e. serialisation, WCF, TCP, network latency...).

Looking at the chatty WCF service interfaces and the tight coupling with the code in the presentation tier it occurred to me that, in actual fact, we were not really building a distributed application but we were distributing an essentially monolithic application. Unfortunately, we didn't actually manage to change this one. I'm convinced, though, that a better way is to deploy two tightly coupled tiers together in the same process and then, again, scale out horizontally.

Monday, October 12, 2009

Remove Unversioned Files in Subversion

When using subversion, I more often than I'd like to admit find myself in need of starting with a clean working copy. Doing a normal (recursive) revert, however, does only undo changes from files that are already under version control and not newly added files which have to be removed separately.

In order to remove all unversioned files from your working copy in one go, you can execute the following command. Think twice before using it, though ;)

svn st | grep "^\?" | sed "s/\? *//" | xargs rm -r