Sunday, October 08, 2017

Bluemix Cloud Foundry: 10 Lessons Learned

I have been meaning to document a bit more my experience as former Bluemix SRE and Operations Engineering Lead during 2014 and 2015, which is something I wrote about at the time I left. The perfect prompt was stumbling upon this Bluemix blog entry on the lessons collectively learned by the team during that period.

This is the kind of blog entry where I must emphasize opinions are my own and do not necessarily reflect the opinions of Bluemix SREs or current processes followed by the Bluemix team. It is also written largely for the benefit of the BOSH and Cloud Foundry development teams and should be read side-by-side with that blog entry.

Lesson 10: Tightly controlled change request

I like the idea of a central process to request and approve the changes and even the idea of a small team of approvers, but the lesson does not mention a key aspect of a good change request process: the inclusion of informed reviewers who can provide input to approvers about the chances of disruption to their offerings.

Identifying the smallest possible list of reviewers is a challenge on its own, lest you end up with endless debate on minutiae and unlikely worries raised by dozens of people, but still something that must be addressed in order to support the approver's decisions.


Lesson 9: Audit deployments for health

No question there. I have a pet-idea for performance management vendors to automatically reconfigure the screen layout of their health audit dashboards once the ops team drills down into one environment. The cost of pulling in the relevant (unhealthy) metrics cannot be overstated when multiple environments require attention.

The second pet-idea is for the same vendors to surface commonalities amongst unhealthy environments. For instance, a DNS failure may affect 10 environments in the same way and it may take a while for a human operator to drill down into a few of these environments until the common cause is identified.

One may argue that the ops team would also be notified about the DNS failure, but realizing the correlation between the alert and the multiple reports of unhealthy environments can sap precious time and energy from the operations team. Alert correlation is not a new concept,  but somehow it seems like a concept requiring a new push nowadays.


Lesson 8: Log checking and monitoring

My pet requirement for "bosh logs" was for the ability to perform some level of parallel searching on the target VMs. Whereas streaming logs to a central infrastructure has obvious merits, there are always cases where large volume logs cannot be fully transfered out thousands of VMs all the time.


Lesson 7: BOSH init woes

No comment, pure agreement, though I have related comments in Lesson 1.



Lesson 6: Migrate all custom software to BOSH releases

No comment, pure agreement, though I have related comments on delayed initialization in BOSH jobs in Lesson 2.


Lesson 5: Do not use PowerDNS (if possible)

Oh boy...add the reality that real large environments like Bluemix will always have internal stages to its CI/CD pipeline and that the CF API URL for those internal pipelines will have an IP address in the Intranet and your installation is virtually  required to implement its own way of allowing special DNS resolution depending on whether traffic is coming from developers in the Intranet or from the software running in the IaaS provider.

Never, ever, ever, use PowerDNS for it. If someone points out that the BOSH director already has PowerDNS running and it is "only a couple of entries", laugh a little, then scowl hard and say "no".


Lesson 4
: Security updates are painful but important


Whereas some security updates could require drastic modifications to the OS (or even a new OS release altogether) , some others were relatively minor and unrelated to the runtime, such as patching the ssh stack used by the operations team.

My advice here is to define *two* security BOSH releases early on, so that not every security update requires a stemcell update. The first release is for security aspects that affect the runtime and have its jobs added as a prereq to other jobs running on the VM, whereas the second release should target security aspects unrelated to the runtime and not be listed as a runtime prereq to other jobs.

I know the argument that stemcell updates should not be feared and treated as any other deployment activity, but when you have to update 1000s of DEAs (and now Diego cells) , a stemcell update could take multiples days in larger deployments, even factoring in parallel VM recreations. Long deployment cycles, however safe, sap energy and attention from the operations team.


Lesson 3: Multi-BOSH deployment

No comment, pure agreement, though I have related comments in Lesson 1.


Lesson 2: Deployments and updates are never 100% successful

This is also a way of saying "deep runtime stacks have uncertain timing characteristics" and "software has bugs".

In the first category, we often suffered with the odd request for new VMs failing or timing out at the IaaS layer because we could be ordering (and disposing of) 1000s of them on a given day and even the best virtualization layer has its limits when you are only one of the customers doing that type of aggressive reciclying during a busy hour.

For the other part of the suffering - actual software bugs - development teams should avoid lazy initialization sequences for their BOSH jobs, which circumvents the purpose of having canaries during the deployment in the first place.

I have countless tales of BOSH deployments that cleared the canary phase and proceed to the next component in the deployment only to see the system destabilize minutes later (CF cloud controllers, I am looking at you) .

I often wished BOSH would allow the insertion of a test step in the deployment to validate the system integrity after the canary VMs were updated and before the rest of the deployment progresses. Some form of automatic rollback in case of failures would be nice too.


Lesson 1: Backup your director database

I would also welcome BOSH's ability to export its contents to a file and be able to rebuild its state from that file, as a RESTful API or as a BOSH level operation.

This would significantly help with disaster recovery AND with the migration from BOSH init to multi-BOSH (lessons 3 and 7) .

Monday, August 28, 2017

Recognition and meaning, by design

I watched with interest this TED presentation by Dan Ariely, titled "What makes us feel good about our work". I immediately noticed the relation to a couple of entries I had written in the past, on the topic of more meaningful relations in the workplace.

Whereas personal initiative remains an essential driving force behind individual progress, knowing that your work matters to someone is scientifically proven to make you feel more motivated. In fact, if you can take 15 minutes to watch Dan's talk, you are bound to be transfixed by this quote:
"...ignoring the performance of people is almost as bad as shredding their effort in front of their eyes."
In a world of increasingly complex activities, supply chains and work arrangements, measuring performance becomes equally more challenging, affecting the frequency and the quality of recognition amongst people working together.

Nevertheless, peer recognition and mutual dependency remain fundamental aspects of healthy work relations, which makes me believe that successful organizations must engineer (yes, engineer) policies that broaden the means and reach of recognition in the workplace.

No perks program, please!

As a common example of a well-intentioned recognition tool, organizations often setup a system of peer-to-peer recognition where employees can award each other a number of points for a special contribution and where the recognized peers can later redeem the points for some sort of perk.

Unless these perks have fundamental impact in the career/life/practice for the recognized person - and they rarely do - perks programs are the perfect example of a poorly designed initiative. I could write a few paragraphs worth of negatives, but I will just list the top three that come to mind: a chore that is not integrated into the workflow, it is unrelated to career development, and worst of all, it is infrequent.
 

Engineered recognition

I particularly liked this toolkit article ("Managing Employee Recognition Programs") as a comprehensive overview of recognition policies and how they can be implemented, and I would add to the list this other article on gamification: "How to Use Gamification to Engage Employees".

Even with all those resources out there, there is still a long way to go, to be paved with concerted effort to elevate peer recognition to a core driver in a culture of appreciation and meaning in the organization.

In concrete terms, such engineered recognition policy should observe the following principles:
  • Clarity on what is recognizable. Is it shorter time-to-market while meeting customer demands? Is it reducing operational costs while maintaining or improving quality? Is it addressing problems off-hours?

    The set of criteria must strike a balance between covering enough good behaviors while not being too overwhelming for those responsible for evaluating against the criteria.

    For teams following an Agile process, a simple and effective mechanism is to use a "quest" tag on stories and let the team collectively vote on the stories that deserve the "quest" tag, typically things that are important for the whole team, but falling through the cracks of collective attention, areas of expertise, and customer demands. Some examples would be the profiling of the whole system periodically for hotspots, or maybe cutting down build time by half.

    The exact technique doesn't matter and organizations can learn a lot about themselves in the process of defining the criteria.

  • Integral part of work stream. Recognition must be a completion criteria for any project and broken down in smaller intervals if the project lasts longer than a few months.

    This point depends on the previous one, that is, the clarity of what is recognizable. Once again, teams following an Agile process can simply tally up the number of stories with the tag "quest" that were completed across the team and put extra emphasis on them during the sprint review meetings.

    As with any part of a work stream, it should go without saying that low overhead is a goal too.

  • Integral part of career advancement. Whereas we still want to take the discretionary powers delegated to a self-organized hierarchy into consideration, the recognition must be partially bound by feedback from peers.

    A holacracy constitution comes to mind. Everyone must be able to contribute, everyone must be subject to the same rights and duties, but not everyone's participation must necessarily carry the same weight (a meritocracy is better than a democracy, sorry folks) . Conversely, recognition must be rooted in peer recognition and defensible against the recognition "constitution".

  • Publicly visible. Barring exceptional circumstances, everyone must have full access to everyone's recognition status.

    The previous example about using sprint review meetings may be supplemented in cases of outstanding effort. Extra points awarded to HR if working data points into company directory, internal online tools, and physical workspace. About the physical workspace, extra coolness points for displaying dynamic content using flat panels in public areas to broaden the impact of recognition.

  • Meaningful. This is probably the most important and challenging part of a recognition program, allowing people, in as much as possible, to find out which parts of the activity they consider important for the larger mission, but also to themselves. This is very hard to achieve in a hierarhical organization and the reason why I am strongly believer of a self-organized free-market for people to find their place within large organizations.

I could explore examples of each of these points, but it is easy for anyone to extrapolate them to the particularities of their own organizations. It is also a fun and transformational exercise for any team out there.

Now imagine how these ideas would fundamentally change team dynamics if more of the work done on a day-to-day basis acquired more meaning, not through contrived directives, but by simply making the intrinsic meaning of the work more visible to the people executing and consuming the work.

Imagine if a significant fraction of individual drive and motivation was not lost over casual lack of recognition for work well done (the virtual shredding of work before our very eyes) . Remembering to say "thank you" still goes a long way and everyone should keep it in mind, but what about designing the work flow so that the "thank you" opportunities are made a regular part of the team relations? 

There is much to be appreciated out there, in life and at work. We know we take a lot of it for granted. Being publicly thankful for all those things costs virtually nothing. It makes us feel better and it makes others feel great.

Now go thank someone. Repeat it everyday.

Wednesday, March 15, 2017

What is your problem? - Part 3: Descriptions

Months ago, I wrote about problem reporting within teams, making a general distinction between good problem reporting that leads to a solution versus insufficient reporting that causes the involved parties to lose precious time during a critical situation.

Now it is turn to look at these from the perspective of software development, which may turn off audiences interested in the general topic, at which point I incur in the blunder of assuming anyone besides my immediate colleagues read these.

Technical notes, your problem is not someone else’s problem…

There are always those moments in software development where overall quality assurance process fails our customers and our standards, at which point we must publish a technical note about the problem. For the project where I based this post, the template of a technical note required 6 fields:
  1. problem description. A general view of of the problem. This is a very difficult topic for most developers who have not been exposed to the problem reporting techniques covered in this series, in that general is confused with imprecise. This topic is therefore the focus of this post.
  2. symptom. List of externally observable behaviors and facts about the system upon occurrence of the problem
  3. cause. List of internal and external triggers for the problem, with special emphasis on those that can be triggered (and hopefully fixed) by the customer versus those that are internal to the product and require a product fix.
  4. affected environments. Complete list of prerequisite software and hardware where the problem can be observed, including versions and releases.
  5. problem diagnosis. Symptoms and causes give a good indication as to whether the problem matches what a customer is seeing, however, the customer needs certainty before moving on to the next field.

  6. problem resolution. The ultimate cause for a customer ever reading through a technical note, how can the problem be either fixed or worked around. A common problem in our internal reviews was that original drafts incurred in the mortal sin of limiting themselves to listing the upcoming release where the problem would be fixed. The customer always expects an interim solution to the problem, even if imperfect.
…so how do I know what is your problem?

To paraphrase one particularly troubled internal draft, we had the symptom, cause and description all rolled into a problem description field like this:
“search for records may be incomplete due to a [private] database being corrupted upon execution of a [series of commands]” .

At that point, we applied the criteria outlined in the previous posting to determine whether the problem reporting to the customer would lead to a decision or to confusion:
  • What is the expected behavior from the product?
The description can be somewhat ‘reversed’ and allow one to infer that search for records should not be incomplete. However, this inference indicates what the product should not do instead of what it should do. For the technical types, this kind of wording tends to make the author look sloppy at best, disingenuous at worst.
  • What is the observed behavior in the product?
The description alludes to incomplete results, but results can be incomplete in so many ways, such as not containing all the records that would match the search criteria, or containing all records while missing some fields in each record.
  • Does the reported problem happen to all units of the product?
  • Does the reported problem affect the entire product or just portions of it? If so, describe the portions?
The ‘product’ here is the operation executed by the user. Is it all searches that are affected or only certain searches?
  • Does the reported problem happen in all locations where the product is used? (this forces the problem owner to have actually sampled the problem in all locations where the product is used) .
Locations can be read as systems. If the product can run on multiple operating systems and depend on various versions of middleware, is there a complete list of systems where the problem occurs? Is it all of them?
  • Does the reported problem happen in combination with other problems?
This particular point would not apply to the original problem description as the problem happened independently of other problems, as a function of search parameters and system operations preceding the searches.
  • When did the problem start? If you don’t know, make it clear you don’t known and state when you first observed it
When reporting the potential problem to a customer, the starting date would translate to the release number where the problem would be first observed.
  • What is the frequency? Continuous, cyclic, or random?
The problem description was reasonably clear about the problem being continuous. At least in my opinion, continuous can be assumed whenever considerations about cyclic or random occurrences are not explicit. In other words, I would consider poor form for those types of frequencies to be omitted.
  • Is the problem specific to a phase in the product life-cycle?
The problem description was reasonably clear about the sequence of operations that would lead to the problem, indicating the problem to affect the system runtime phase versus planning, installation, configuration, or any other.
  • Is the symptom stable or worsening?
The problem description did not mention increasing degradation of results, but it is worth asking that question during a review process prior to publication of the technical note.

From problems to satisfied customers

This is an area to be approached with energy and patience while coaching people who are new to any field in the industry. Describing problems as a function of language and critical-thinking is not an exact science and requires prolonged periods of practice and feedback to be mastered.

When someone without the proper training in problem description encounters someone on the other side who will go out of their way to understand the problem, it is easy to mistake the positive interaction rooted in an act of kindness for the most efficient way of going about it. And whereas acts of kindness are still a core value in the workplace, on any given day we would rather have that kind person interacting with more people rather than spend it all on a single person working without proper training.

Once you have put the right effort behind training people in this topic (or training yourself), you will have started a transformational effect on people, ending up with people who ask the right questions to solve the right problems on their own.

Featured Post

Crowds in the clouds, a brave old world