Sunday, October 08, 2017

Bluemix Cloud Foundry: 10 Lessons Learned

I have been meaning to document a bit more my experience as former Bluemix SRE and Operations Engineering Lead during 2014 and 2015, which is something I wrote about at the time I left. The perfect prompt was stumbling upon this Bluemix blog entry on the lessons collectively learned by the team during that period.

This is the kind of blog entry where I must emphasize opinions are my own and do not necessarily reflect the opinions of Bluemix SREs or current processes followed by the Bluemix team. It is also written largely for the benefit of the BOSH and Cloud Foundry development teams and should be read side-by-side with that blog entry.

Lesson 10: Tightly controlled change request

I like the idea of a central process to request and approve the changes and even the idea of a small team of approvers, but the lesson does not mention a key aspect of a good change request process: the inclusion of informed reviewers who can provide input to approvers about the chances of disruption to their offerings.

Identifying the smallest possible list of reviewers is a challenge on its own, lest you end up with endless debate on minutiae and unlikely worries raised by dozens of people, but still something that must be addressed in order to support the approver's decisions.


Lesson 9: Audit deployments for health

No question there. I have a pet-idea for performance management vendors to automatically reconfigure the screen layout of their health audit dashboards once the ops team drills down into one environment. The cost of pulling in the relevant (unhealthy) metrics cannot be overstated when multiple environments require attention.

The second pet-idea is for the same vendors to surface commonalities amongst unhealthy environments. For instance, a DNS failure may affect 10 environments in the same way and it may take a while for a human operator to drill down into a few of these environments until the common cause is identified.

One may argue that the ops team would also be notified about the DNS failure, but realizing the correlation between the alert and the multiple reports of unhealthy environments can sap precious time and energy from the operations team. Alert correlation is not a new concept,  but somehow it seems like a concept requiring a new push nowadays.


Lesson 8: Log checking and monitoring

My pet requirement for "bosh logs" was for the ability to perform some level of parallel searching on the target VMs. Whereas streaming logs to a central infrastructure has obvious merits, there are always cases where large volume logs cannot be fully transfered out thousands of VMs all the time.


Lesson 7: BOSH init woes

No comment, pure agreement, though I have related comments in Lesson 1.



Lesson 6: Migrate all custom software to BOSH releases

No comment, pure agreement, though I have related comments on delayed initialization in BOSH jobs in Lesson 2.


Lesson 5: Do not use PowerDNS (if possible)

Oh boy...add the reality that real large environments like Bluemix will always have internal stages to its CI/CD pipeline and that the CF API URL for those internal pipelines will have an IP address in the Intranet and your installation is virtually  required to implement its own way of allowing special DNS resolution depending on whether traffic is coming from developers in the Intranet or from the software running in the IaaS provider.

Never, ever, ever, use PowerDNS for it. If someone points out that the BOSH director already has PowerDNS running and it is "only a couple of entries", laugh a little, then scowl hard and say "no".


Lesson 4
: Security updates are painful but important


Whereas some security updates could require drastic modifications to the OS (or even a new OS release altogether) , some others were relatively minor and unrelated to the runtime, such as patching the ssh stack used by the operations team.

My advice here is to define *two* security BOSH releases early on, so that not every security update requires a stemcell update. The first release is for security aspects that affect the runtime and have its jobs added as a prereq to other jobs running on the VM, whereas the second release should target security aspects unrelated to the runtime and not be listed as a runtime prereq to other jobs.

I know the argument that stemcell updates should not be feared and treated as any other deployment activity, but when you have to update 1000s of DEAs (and now Diego cells) , a stemcell update could take multiples days in larger deployments, even factoring in parallel VM recreations. Long deployment cycles, however safe, sap energy and attention from the operations team.


Lesson 3: Multi-BOSH deployment

No comment, pure agreement, though I have related comments in Lesson 1.


Lesson 2: Deployments and updates are never 100% successful

This is also a way of saying "deep runtime stacks have uncertain timing characteristics" and "software has bugs".

In the first category, we often suffered with the odd request for new VMs failing or timing out at the IaaS layer because we could be ordering (and disposing of) 1000s of them on a given day and even the best virtualization layer has its limits when you are only one of the customers doing that type of aggressive reciclying during a busy hour.

For the other part of the suffering - actual software bugs - development teams should avoid lazy initialization sequences for their BOSH jobs, which circumvents the purpose of having canaries during the deployment in the first place.

I have countless tales of BOSH deployments that cleared the canary phase and proceed to the next component in the deployment only to see the system destabilize minutes later (CF cloud controllers, I am looking at you) .

I often wished BOSH would allow the insertion of a test step in the deployment to validate the system integrity after the canary VMs were updated and before the rest of the deployment progresses. Some form of automatic rollback in case of failures would be nice too.


Lesson 1: Backup your director database

I would also welcome BOSH's ability to export its contents to a file and be able to rebuild its state from that file, as a RESTful API or as a BOSH level operation.

This would significantly help with disaster recovery AND with the migration from BOSH init to multi-BOSH (lessons 3 and 7) .

Featured Post

Crowds in the clouds, a brave old world