Please note: this article was originally published on Pagero’s site.
At Pagero we are very proud of the technical architecture of our flagship product, Pagero Online. We’re successfully handling more document transactions than ever, as we see an ever increasing demand for e-document services. In this article I’d like to tell you a little about the journey we’ve taken, from humble beginnings almost ten years ago, to the present day and beyond. I’ll be talking a little about the technology stack we’ve chosen, including the business and technical reasoning behind our choices. If you’ve ever worked on a high availability, cloud based platform handling millions of events, or aspire to do so, you could be interested in our story.
This spring I was at the Craft conference in Budapest, which I thoroughly recommend by the way. There was a full program, with lot of great sessions, and interesting speakers. I did notice, browsing the program beforehand, that there were a lot of talks about Microservices, and Docker. Everyone seemed to have an opinion on the best deployment options, how to manage distributed data, building, testing, logging… This is clearly the hip and trendy way to build systems these days. I found this quite gratifying, since at Pagero we’ve been using a Microservices architecture for some time now, and have been using Docker in production since early 2014. It’s become our everyday life, not some hyped trend that we just heard about. Our reasons for going with Docker and Microservices are firmly rooted in the needs of our business.
Let me explain. Pagero Online is a cloud-based platform for exchange of electronic documents between businesses, for example invoices and orders. The point is, our customers can send their documents to us in whatever format their internal system produces, and we will deliver it in the format the recipient finds easiest to process in their internal systems. It’s clearly a valuable service, since we have an impressive year on year growth in document transactions.
The growth illustrates the challenge we’ve been meeting successfully for several years now – to scale our cloud system to handle ever increasing traffic. It’s of course a great problem to have, and we in the R&D department have worked hard to keep everything running smoothly throughout. The architecture we had when we started, is not the architecture we have now.
IN THE BEGINNING
Back in 2007, Enterprise Java Beans were the thing to do, and we felt confident we were building a future-proof, scalable system, using a JBoss container talking to a PostgreSQL database. Moore’s law meant that we could initially scale just by buying a bigger machine now and then. As time went by, we needed more, and started using the clustering capabilities built into the J2EE platform – i.e. several instances of the same code, receiving requests via a load balancer. At some point in about 2012 we realized this approach could no longer handle the increase in traffic that we were experiencing. We could no longer just add new instances of the same code, the slowdown from the communication overhead between them would be greater than the speedup from the increased CPU power. We needed to give more CPU power to just a few parts of the code that were doing the most intensive processing, without also hitting the communication bottlenecks.
ENTER MICROSERVICES AND DOCKER
Everything was pointing to a need to break apart our monolith into more manageable pieces. Microservices and Docker seemed the perfect match to our problems, so we spent the next year or so building the infrastructure needed. In February 2014 we deployed our monolith, packaged in a Docker container, together with some essential services for monitoring, service discovery, and message passing, (with protobuf over Rabbit MQ). Over the following months, the whole of our R&D department completed a course in the Scala programming language, and we built and deployed several more services for new features in the system. It worked! Since 2014 we have been able to quickly grow to about twenty services a year later, and sixty today.
We’ve realized the Microservices architecture enabled organizational streamlining too. Over the years our development team has grown from a handful of developers in the same room, to about 30 people split across three time zones. By breaking up the codebase, we can also divide up the development work more efficiently. We now have half a dozen ‘devops’ teams each responsible for a handful of Microservices. Both new and seasoned developers are more productive when working in these smaller codebases.
SCALING THE DATABASE
It was in around mid-2015, however, we started to see where the bottleneck had moved to, now the application code was performing better. Our trusty PostgreSQL database was handling a good many more gigabytes of data than ever before, and some transactions were getting a little slow. We concocted a plan to split it up too, just like we were doing with the monolith of code. We settled upon Cassandra and worked out how we were going to safely migrate all the document data out of Postgres, and into this distributed data store. The rest of the data will remain where it is, but just taking out the documents should free up a good deal of space, and release the main bottleneck. We of course need to do this without disrupting our service in any way, so one way to reduce the risk is to run the new Cassandra database in parallel with the existing Postgres, duplicating all the data. Only once we’ve done extensive testing, and we can see it’s working ok, will we remove the redundant copy
That’s kind of where we are now, we have just started this parallel running, and initial results are looking good.
THE BIG BREAK-UP
The next challenge is to continue to break apart our monolith of code, and create new services out of the pieces. Although all our new features are being built in Microservices, we still have the heart of the system in the former monolith. We’ve seen so many benefits to having Microservices, we’d like all our code to look like that. In some ways it’s a more daunting prospect than breaking up the database. This is a large quantity of tried and tested code that has been running in production for many years – breaking it up is not something you can do over a weekend!
We have to make this big change without any interruption to our production service, and we’ve thought carefully about what our strategy should be. One way to do a big risky change is to split it into a series of less-risky, smaller changes. The idea is that after every step in the break up, to run a battery of automated regression tests. The shorter the time the tests take to run, the smaller increments we can work with, and less risk of breaking anything. I’m personally pretty excited by this prospect. We’ve spent several years now building and improving our automated tests for Pagero Online, to the point where we feel pretty confident in taking on this challenge.
The other part of the strategy is to do the same as we have with the database migration. We’ll run both the old and new versions of the service in production for a while before we cut over to the new one. This should find any issues missed by the automated tests, without affecting any of our production traffic.
It’s going to be a real proof of how good our testing and deployment routines are. What kind of tests and deployment tools we’ve built, now that’s a topic for another blog post. If I’m lucky, I might even be telling you about the hip and trendy hot technologies that will be all over the agenda of the next Craft conference :-).