Friday, May 9, 2014

Devops, Cloud & Co.

Current Trends in Software Development and System Operations

Based on a talk I gave on November the 6th 2013.

The world of software development and system operations is changing drastically. While software used to be a product that is finished and delivered to a customer, it is now as often as not a continuous service provided to a customer. This concept of Software as a Service affects how developers and operators work.

DevOps

In the good old days of software as a product, software went through simple life cycles. A group of developers wrote a piece of software, tested it, and deemed it ready. Then a completely different group, operations, picked up this ready software, deployed it to a server, and maintained it there. If the software needed changes that were not in the original spec, the cycle began from the start. There was usually a very strict division between development and operations.

While this model is still in use for infrastructure software like web servers or databases, when software is not written as a product anymore, but as a service, this separation of concerns breaks down. When software is written as a service, it can be tailored to a specific deployment environment and also needs to be adapted to customer requests much more quickly.

That means developers can target and test in a specific deployment environment, and system operators need to deploy new versions much faster. Because of this, the two groups need to interact a lot more than in the traditional model, muddling any kind of clear separation there ever was.

This combination between Development and Operations for Software as a Service is called DevOps.

Developing for a Specific Environment

Automated testing allows to test many aspects of a software without being repetitive to a human tester. This starts with unit tests and goes up to function and system tests, which test a whole software package. In continuous integration, automated tests are run on every push to a revision control system. This ensures continuous and immediate feedback, resulting in a working application at almost every point in time.

When software is written for a specific environment, these automated tests can be run in that environment, as well as being written to tackle specific requirements of the target environment. Also, targeting a specific deployment environment means that all environments the software is developed in, from development through testing, staging and production can be as close as possible to each other to reduce the number of bugs introduced by environmental effects.

This results in a large number of setups that should be as similar to each other as possible. Development systems alone are numerous, as every developer needs their own environment so they don’t step on each other’s toes during development, while these environments tend to break regularly and need to be set up again.

Recent advances in virtualization allow this. All systems can start from the identical base image and be configured to the target environment requirements. To make it easier to configure a large number of machines in a reproducible way, the configuration for servers is being automated as well. Configuration Management tools use a configuration description to set up any number of servers the exact same way.

Using this, development, testing, staging and production all run on servers that look almost identical, reducing possible problems due to configuration and environmental differences. Continuous integration means that the software on the central repository is constantly checked for problems, both making it more likely to find problems before deployment even begins, and also helping to make sure that there is a deployment-ready version of the software at all times.

Deploying Releases Faster

All software has bugs, though. When a bug is found, developers will find a fix and commit it to their source repository. The next goal is to reduce the time between when a bug fix is committed and when that fix is deployed to production.

Just to give a bit of perspective here, in a typical Software as a Service environment, it is not unusual to aim for two to three releases a day, and being able to handle ten releases a day if necessary. This is drastically different from the turnover speed in classic software development.

Again, automation is key here. The same way a code push triggers a test run in continuous integration, a successful test run can trigger a deploy of the software to a staging system where acceptance tests are run, and if they succeed, the system can automatically deploy the software to production. If required, there can also be a manual step where a decision maker can take a last look, but even then the manual work is kept to a minimum, with deployment being fully automated.

The full cycle for a bug fix then becomes a sequence of automated steps. The software is checked in by development, which triggers a test run on the continuous integration server. If this fails, development gets immediate feedback and can try again. If the tests pass, the software is automatically deployed to staging and acceptance tests are run. Again, if they fail, development gets immediate feedback. And if they pass, the software is automatically deployed to production.

This reduces the time between a bug fix being written and it being deployed to the time the tests need to run, without requiring any manual work from operations to make it happen. Operations then is tasked not with deploying software, but with keeping the whole infrastructure working to allow this development model.

Cloud

Optimizing software for this work pattern changes the way software is designed.

To make full use of continuous integration and delivery, every application gets its own, well-defined environment, which stays the same between development, testing, staging and production. This allows applications to be broken down further into small components that all have their own specific environment. These components communicate with each other via a well-defined interface, usually via sockets. If you have a web server that talks to a cache on one hand and to an application server on the other via sockets, and the application server talks to a database via sockets, you can switch out each component independently from the others. This decoupling encourages replacing monolithic applications with smaller pieces of software.

Once software is decoupled in this way, the components become a powerful tool for scaling.

New features can be added by introducing further process types, not only by adding features to a single process. More workload then can be handled by starting multiple instances of the same process type. And even more so, running the same process multiple times gives the whole product fault tolerance, as a single process dying does not result in a full loss of functionality.

As software is written to consist of a large number of independent processes, and virtualization allows multiple independent servers to run on the same hardware, and configuration management allows to configure a large number of servers, it’s easy to see how you can end up with a large number of servers that are each dedicated to specific application processes.

This creates interesting possibilities.

Servers do not need to have software upgrades anymore, with all the problems and side effects this can have. Instead, operations can simply install a server with the new OS version and the same configuration, test it, and if it works, just shut down the server with the old version.

Likewise, if a server exhibits a weird bug, instead of spending a long time trying to track it down, operations can simply start up a new server with the same configuration, test it, and replace the broken server with it.

And of course, this allows the scaling people often associate with cloud services. Your server has a high load? Well, run a few more servers of the same type. This can be scheduled, too. If you know that your users come home at 5 p.m. and start using your services, you can simply spool up a few load handlers at 4:30. At 11 p.m., when your users go to bed, you can shut them down again. Or you can do this responsively. Instead of having Nagios call you at 3 a.m. because your main server is dying, it can simply automatically start a few more instances and inform you of this in the morning. A handful of instances for a few hours cost very little money, and is certainly cheaper (and more humane) than waking up a specialist in the middle of the night.

Or, to get really wild, you can make this proactive. You can have a program monitor large news sites like Hacker News or Slashdot, and if it finds your site mentioned there, simply spool up a few more servers even before the hits start coming.

All of this requires statistics and monitoring, two more topics that are very important in the devops world and really would deserve their own blog posts.

Summary

Software as a Service breaks down the separation between development and operations, leading to DevOps teams. These teams utilize the specific target environment to improve the quality of continuous integration to find more problems before deployment. They also utilize heavy automation with continuous delivery to decrease the time between fixing a bug and deploying the fixed software to production.

The desire to test software in comparable environments and the availability of virtualization options leads to small components with well-defined interfaces, running on dedicated servers to decrease interdependence, set up by configuration management tools to ensure reproducibility.

And this in turn leads to a situation where applications consist of small, decoupled components that can easily be run in parallel, which allows to scale the application depending on load while providing fault tolerance at the same time.

And that’s why devops, cloud & co. is actually quite awesome.