At AppScale Systems, our goal is to have production quality code, so that when you run AppScale, its easy to use, robust, and fast. A couple years ago, the code was, quite frankly, pretty bad. Unit test? We don’t need no stinking unit tests! End-to-end tests? Forgetaboutit. Consistent style? Ha!
The code didn’t work. A phrase we said often was: “It works for me.” You were lucky if you got AppScale running on the first attempt. We had race conditions everywhere trying to start up. The backend had bugs galore and was slow as shit. Regressions would come in and we wouldn’t catch them upon a release.
We had no process. It was the wild west. As soon as we commercialized AppScale and started to have customers, this had to stop. It was getting too costly both for our reputation and the amount of engineering hours tracking down bugs.
Time to get Civilized
The first thing we did was switch to scrum process. We do a 3 week development process, and 1 week QA process. We’re able to get out monthly releases and we’re always shipping. Our sprints are publicly visible for everyone to see on trello. We get a lot of feature requests and bug reports. By having this board open to the public, we get buy in from our users and customers as to how important their issue really is. We also create cards when a user requests a feature or bug fix, and they can watch the status of their card as it goes from prioritized, checked out, doing, done, QA’ing, and QA’ed. You can see who is working on what, and we are as transparent as we can be.
We have multiple hurdles code has to go through to get into AppScale’s master repository. Here is what we require any feature or bug fix to go through:
Report pylint on new python code
Show that you’re code has the latest from the master repo
Demo the bug fix or feature on a single node
Demo on a four node deployment
Pass all unit tests
Pass all fidelity tests (Test Compatibility Kit)
Code Review (on github)
Put the pull request on the trello card
Comment on the trello card on how to QA it
Pass the continuous testing system (tests on multiple clouds)
Each of the processes in place were put there because at the end of our sprint our retrospective revealed we needed preventative measures from bugs slipping in. Pylint helps make sure that the code meets style guidelines, while also pointing out potential bugs. The requirement to pull from testing makes sure that some other code feature did not break the current branch. Demos to another person often reveal issues, and we allow the person being demo’ed to the freedom to request additional challenges to catch corner cases. Quite often, a fix or feature would work in a single node deployment but break fantastically on a multinode deployment (distributed systems are fun! yay!). Our unit tests, fidelity tests, and CT system also make sure we have no regressions in the system. Lastly, the trello board updates make sure that everyone on the team has access to the same information and can quickly QA a feature without having to ask the author of the code.
How I Learned to Stop Worrying and Love the Process
Not all code pushes require each and every step. Some updates like updating a comment might only have to go through a quick code review. We encourage you to reflect on the mistakes being made and enforce process to prevent them. At first it may seem like too much oversight, but over time you’ll see it saves more time than it takes up.