Sunday, October 08, 2017

Bluemix Cloud Foundry: 10 Lessons Learned

I have been meaning to document a bit more my experience as former Bluemix SRE and Operations Engineering Lead during 2014 and 2015, which is something I wrote about at the time I left. The perfect prompt was stumbling upon this Bluemix blog entry on the lessons collectively learned by the team during that period.

This is the kind of blog entry where I must emphasize opinions are my own and do not necessarily reflect the opinions of Bluemix SREs or current processes followed by the Bluemix team. It is also written largely for the benefit of the BOSH and Cloud Foundry development teams and should be read side-by-side with that blog entry.

Lesson 10: Tightly controlled change request

I like the idea of a central process to request and approve the changes and even the idea of a small team of approvers, but the lesson does not mention a key aspect of a good change request process: the inclusion of informed reviewers who can provide input to approvers about the chances of disruption to their offerings.

Identifying the smallest possible list of reviewers is a challenge on its own, lest you end up with endless debate on minutiae and unlikely worries raised by dozens of people, but still something that must be addressed in order to support the approver's decisions.

Lesson 9: Audit deployments for health

No question there. I have a pet-idea for performance management vendors to automatically reconfigure the screen layout of their health audit dashboards once the ops team drills down into one environment. The cost of pulling in the relevant (unhealthy) metrics cannot be overstated when multiple environments require attention.

The second pet-idea is for the same vendors to surface commonalities amongst unhealthy environments. For instance, a DNS failure may affect 10 environments in the same way and it may take a while for a human operator to drill down into a few of these environments until the common cause is identified.

One may argue that the ops team would also be notified about the DNS failure, but realizing the correlation between the alert and the multiple reports of unhealthy environments can sap precious time and energy from the operations team. Alert correlation is not a new concept,  but somehow it seems like a concept requiring a new push nowadays.

Lesson 8: Log checking and monitoring

My pet requirement for "bosh logs" was for the ability to perform some level of parallel searching on the target VMs. Whereas streaming logs to a central infrastructure has obvious merits, there are always cases where large volume logs cannot be fully transfered out thousands of VMs all the time.

Lesson 7: BOSH init woes

No comment, pure agreement, though I have related comments in Lesson 1.

Lesson 6: Migrate all custom software to BOSH releases

No comment, pure agreement, though I have related comments on delayed initialization in BOSH jobs in Lesson 2.

Lesson 5: Do not use PowerDNS (if possible)

Oh boy...add the reality that real large environments like Bluemix will always have internal stages to its CI/CD pipeline and that the CF API URL for those internal pipelines will have an IP address in the Intranet and your installation is virtually  required to implement its own way of allowing special DNS resolution depending on whether traffic is coming from developers in the Intranet or from the software running in the IaaS provider.

Never, ever, ever, use PowerDNS for it. If someone points out that the BOSH director already has PowerDNS running and it is "only a couple of entries", laugh a little, then scowl hard and say "no".

Lesson 4
: Security updates are painful but important

Whereas some security updates could require drastic modifications to the OS (or even a new OS release altogether) , some others were relatively minor and unrelated to the runtime, such as patching the ssh stack used by the operations team.

My advice here is to define *two* security BOSH releases early on, so that not every security update requires a stemcell update. The first release is for security aspects that affect the runtime and have its jobs added as a prereq to other jobs running on the VM, whereas the second release should target security aspects unrelated to the runtime and not be listed as a runtime prereq to other jobs.

I know the argument that stemcell updates should not be feared and treated as any other deployment activity, but when you have to update 1000s of DEAs (and now Diego cells) , a stemcell update could take multiples days in larger deployments, even factoring in parallel VM recreations. Long deployment cycles, however safe, sap energy and attention from the operations team.

Lesson 3: Multi-BOSH deployment

No comment, pure agreement, though I have related comments in Lesson 1.

Lesson 2: Deployments and updates are never 100% successful

This is also a way of saying "deep runtime stacks have uncertain timing characteristics" and "software has bugs".

In the first category, we often suffered with the odd request for new VMs failing or timing out at the IaaS layer because we could be ordering (and disposing of) 1000s of them on a given day and even the best virtualization layer has its limits when you are only one of the customers doing that type of aggressive reciclying during a busy hour.

For the other part of the suffering - actual software bugs - development teams should avoid lazy initialization sequences for their BOSH jobs, which circumvents the purpose of having canaries during the deployment in the first place.

I have countless tales of BOSH deployments that cleared the canary phase and proceed to the next component in the deployment only to see the system destabilize minutes later (CF cloud controllers, I am looking at you) .

I often wished BOSH would allow the insertion of a test step in the deployment to validate the system integrity after the canary VMs were updated and before the rest of the deployment progresses. Some form of automatic rollback in case of failures would be nice too.

Lesson 1: Backup your director database

I would also welcome BOSH's ability to export its contents to a file and be able to rebuild its state from that file, as a RESTful API or as a BOSH level operation.

This would significantly help with disaster recovery AND with the migration from BOSH init to multi-BOSH (lessons 3 and 7) .

Monday, August 28, 2017

Recognition and meaning, by design

I watched with interest this TED presentation by Dan Ariely, titled "What makes us feel good about our work". I immediately noticed the relation to a couple of entries I had written in the past, on the topic of more meaningful relations in the workplace.

Whereas personal initiative remains an essential driving force behind individual progress, knowing that your work matters to someone is scientifically proven to make you feel more motivated. In fact, if you can take 15 minutes to watch Dan's talk, you are bound to be transfixed by this quote:
"...ignoring the performance of people is almost as bad as shredding their effort in front of their eyes."
In a world of increasingly complex activities, supply chains and work arrangements, measuring performance becomes equally more challenging, affecting the frequency and the quality of recognition amongst people working together.

Nevertheless, peer recognition and mutual dependency remain fundamental aspects of healthy work relations, which makes me believe that successful organizations must engineer (yes, engineer) policies that broaden the means and reach of recognition in the workplace.

No perks program, please!

As a common example of a well-intentioned recognition tool, organizations often setup a system of peer-to-peer recognition where employees can award each other a number of points for a special contribution and where the recognized peers can later redeem the points for some sort of perk.

Unless these perks have fundamental impact in the career/life/practice for the recognized person - and they rarely do - perks programs are the perfect example of a poorly designed initiative. I could write a few paragraphs worth of negatives, but I will just list the top three that come to mind: a chore that is not integrated into the workflow, it is unrelated to career development, and worst of all, it is infrequent.

Engineered recognition

I particularly liked this toolkit article ("Managing Employee Recognition Programs") as a comprehensive overview of recognition policies and how they can be implemented, and I would add to the list this other article on gamification: "How to Use Gamification to Engage Employees".

Even with all those resources out there, there is still a long way to go, to be paved with concerted effort to elevate peer recognition to a core driver in a culture of appreciation and meaning in the organization.

In concrete terms, such engineered recognition policy should observe the following principles:
  • Clarity on what is recognizable. Is it shorter time-to-market while meeting customer demands? Is it reducing operational costs while maintaining or improving quality? Is it addressing problems off-hours?

    The set of criteria must strike a balance between covering enough good behaviors while not being too overwhelming for those responsible for evaluating against the criteria.

    For teams following an Agile process, a simple and effective mechanism is to use a "quest" tag on stories and let the team collectively vote on the stories that deserve the "quest" tag, typically things that are important for the whole team, but falling through the cracks of collective attention, areas of expertise, and customer demands. Some examples would be the profiling of the whole system periodically for hotspots, or maybe cutting down build time by half.

    The exact technique doesn't matter and organizations can learn a lot about themselves in the process of defining the criteria.

  • Integral part of work stream. Recognition must be a completion criteria for any project and broken down in smaller intervals if the project lasts longer than a few months.

    This point depends on the previous one, that is, the clarity of what is recognizable. Once again, teams following an Agile process can simply tally up the number of stories with the tag "quest" that were completed across the team and put extra emphasis on them during the sprint review meetings.

    As with any part of a work stream, it should go without saying that low overhead is a goal too.

  • Integral part of career advancement. Whereas we still want to take the discretionary powers delegated to a self-organized hierarchy into consideration, the recognition must be partially bound by feedback from peers.

    A holacracy constitution comes to mind. Everyone must be able to contribute, everyone must be subject to the same rights and duties, but not everyone's participation must necessarily carry the same weight (a meritocracy is better than a democracy, sorry folks) . Conversely, recognition must be rooted in peer recognition and defensible against the recognition "constitution".

  • Publicly visible. Barring exceptional circumstances, everyone must have full access to everyone's recognition status.

    The previous example about using sprint review meetings may be supplemented in cases of outstanding effort. Extra points awarded to HR if working data points into company directory, internal online tools, and physical workspace. About the physical workspace, extra coolness points for displaying dynamic content using flat panels in public areas to broaden the impact of recognition.

  • Meaningful. This is probably the most important and challenging part of a recognition program, allowing people, in as much as possible, to find out which parts of the activity they consider important for the larger mission, but also to themselves. This is very hard to achieve in a hierarhical organization and the reason why I am strongly believer of a self-organized free-market for people to find their place within large organizations.

I could explore examples of each of these points, but it is easy for anyone to extrapolate them to the particularities of their own organizations. It is also a fun and transformational exercise for any team out there.

Now imagine how these ideas would fundamentally change team dynamics if more of the work done on a day-to-day basis acquired more meaning, not through contrived directives, but by simply making the intrinsic meaning of the work more visible to the people executing and consuming the work.

Imagine if a significant fraction of individual drive and motivation was not lost over casual lack of recognition for work well done (the virtual shredding of work before our very eyes) . Remembering to say "thank you" still goes a long way and everyone should keep it in mind, but what about designing the work flow so that the "thank you" opportunities are made a regular part of the team relations? 

There is much to be appreciated out there, in life and at work. We know we take a lot of it for granted. Being publicly thankful for all those things costs virtually nothing. It makes us feel better and it makes others feel great.

Now go thank someone. Repeat it everyday.

Wednesday, March 15, 2017

What is your problem? - Part 3: Descriptions

Months ago, I wrote about problem reporting within teams, making a general distinction between good problem reporting that leads to a solution versus insufficient reporting that causes the involved parties to lose precious time during a critical situation.

Now it is turn to look at these from the perspective of software development, which may turn off audiences interested in the general topic, at which point I incur in the blunder of assuming anyone besides my immediate colleagues read these.

Technical notes, your problem is not someone else’s problem…

There are always those moments in software development where overall quality assurance process fails our customers and our standards, at which point we must publish a technical note about the problem. For the project where I based this post, the template of a technical note required 6 fields:
  1. problem description. A general view of of the problem. This is a very difficult topic for most developers who have not been exposed to the problem reporting techniques covered in this series, in that general is confused with imprecise. This topic is therefore the focus of this post.
  2. symptom. List of externally observable behaviors and facts about the system upon occurrence of the problem
  3. cause. List of internal and external triggers for the problem, with special emphasis on those that can be triggered (and hopefully fixed) by the customer versus those that are internal to the product and require a product fix.
  4. affected environments. Complete list of prerequisite software and hardware where the problem can be observed, including versions and releases.
  5. problem diagnosis. Symptoms and causes give a good indication as to whether the problem matches what a customer is seeing, however, the customer needs certainty before moving on to the next field.

  6. problem resolution. The ultimate cause for a customer ever reading through a technical note, how can the problem be either fixed or worked around. A common problem in our internal reviews was that original drafts incurred in the mortal sin of limiting themselves to listing the upcoming release where the problem would be fixed. The customer always expects an interim solution to the problem, even if imperfect.
…so how do I know what is your problem?

To paraphrase one particularly troubled internal draft, we had the symptom, cause and description all rolled into a problem description field like this:
“search for records may be incomplete due to a [private] database being corrupted upon execution of a [series of commands]” .

At that point, we applied the criteria outlined in the previous posting to determine whether the problem reporting to the customer would lead to a decision or to confusion:
  • What is the expected behavior from the product?
The description can be somewhat ‘reversed’ and allow one to infer that search for records should not be incomplete. However, this inference indicates what the product should not do instead of what it should do. For the technical types, this kind of wording tends to make the author look sloppy at best, disingenuous at worst.
  • What is the observed behavior in the product?
The description alludes to incomplete results, but results can be incomplete in so many ways, such as not containing all the records that would match the search criteria, or containing all records while missing some fields in each record.
  • Does the reported problem happen to all units of the product?
  • Does the reported problem affect the entire product or just portions of it? If so, describe the portions?
The ‘product’ here is the operation executed by the user. Is it all searches that are affected or only certain searches?
  • Does the reported problem happen in all locations where the product is used? (this forces the problem owner to have actually sampled the problem in all locations where the product is used) .
Locations can be read as systems. If the product can run on multiple operating systems and depend on various versions of middleware, is there a complete list of systems where the problem occurs? Is it all of them?
  • Does the reported problem happen in combination with other problems?
This particular point would not apply to the original problem description as the problem happened independently of other problems, as a function of search parameters and system operations preceding the searches.
  • When did the problem start? If you don’t know, make it clear you don’t known and state when you first observed it
When reporting the potential problem to a customer, the starting date would translate to the release number where the problem would be first observed.
  • What is the frequency? Continuous, cyclic, or random?
The problem description was reasonably clear about the problem being continuous. At least in my opinion, continuous can be assumed whenever considerations about cyclic or random occurrences are not explicit. In other words, I would consider poor form for those types of frequencies to be omitted.
  • Is the problem specific to a phase in the product life-cycle?
The problem description was reasonably clear about the sequence of operations that would lead to the problem, indicating the problem to affect the system runtime phase versus planning, installation, configuration, or any other.
  • Is the symptom stable or worsening?
The problem description did not mention increasing degradation of results, but it is worth asking that question during a review process prior to publication of the technical note.

From problems to satisfied customers

This is an area to be approached with energy and patience while coaching people who are new to any field in the industry. Describing problems as a function of language and critical-thinking is not an exact science and requires prolonged periods of practice and feedback to be mastered.

When someone without the proper training in problem description encounters someone on the other side who will go out of their way to understand the problem, it is easy to mistake the positive interaction rooted in an act of kindness for the most efficient way of going about it. And whereas acts of kindness are still a core value in the workplace, on any given day we would rather have that kind person interacting with more people rather than spend it all on a single person working without proper training.

Once you have put the right effort behind training people in this topic (or training yourself), you will have started a transformational effect on people, ending up with people who ask the right questions to solve the right problems on their own.

Monday, November 28, 2016

On discipline, agile, lean and kitchens

Discipline in planning and execution is not a culture war.

I used to see method and process as core execution discliplines, as background activities that one just took for granted on any project. For the past few years, I have witnessed an increasing swell of support for making them a matter of style and team culture.

I would place the shift starting a couple of years after the Lean Startup book came out and inadvertently normalized the notion of outcomes over processes even outside the startup world.

In fairness, that is not what the book and underlying concepts propose, but there were large swaths of the software development field who were ready for a well-articulated message that advocated for less formalism, less planning, less checking, less verification, get the idea.

By Mark Morgan
The ensuing phylosophical battles can be distilled to basic questions such as "how much time should one spend writing down the processes to be followed by a team?" or "How much time should one spend planning activities and estimating their costs before deciding on a course of action?"

The "it-is-boring" camp argues for less discipline in favor of faster execution, which allows for more iterations towards success.  For this camp, the air is sucked out of the room the minute someone asks for  agreement on processes and plans.

The "it-is-boorish" camp argues that unmanaged chaos can seize up execution progress and force part of the team into silently picking up the slack.

I can see a strong correlation between individual style and the choice of camps, which is why it is so easy to dismiss the entire unpleasantness of the debate as a matter of personal style or culture war. 

About "Let us choose"...

Sprinkle some conflict aversion, superficial analysis and we soon find the discussion abandoned behind the wall of false compromises: "let each team choose what works best for them".

I find that reasoning particularly disingenuous in that it implies that the only alternative is an arbirtrary decision forcing freedom fighters into submission. Given time and the right audience, you may even see an Austrian economist or two being quoted in the discussion.

To be clear, teams should absolutely decide what works best for them, as long as the selection of "best" is made against a (preferably long) background of good and bad experiences.

And it goes without saying that the discussion should stay away from the extremes of Go-Horse programming*, with virtually no time assigned for planning, and of the waterfall model where nothing meaningful ever hits the market (and no one has tried pure watefall for at least two decades nor would in their right minds argue for its return) .

This is the point where I confess to lean towards the "it-is-boorish" camp and my reasons are simple:

When no discipline is actually a lot of it

Hidden in the anedoctes about improved results due to less planning and less process, there are invariably teams with extensive practice with planning and processes. Behind each IPO-wonder, you will find leaders with well-lived experiences ahead of other similar initiatives (if you absolutely must bring up Facebook, that is a different animal, leave a comment and I will respond to it) .

Teams do not succeed because they have less discipline, they succeed because they have people who know enough about discipline and processes to hand pick the correct approach for the circumstance at hand. Moving from process to actual results, the parallel is that behind each story of frequent and short iterations leading to a winning design you will find people who have produced winning designs in the past and who have had the benefit of internalizing what worked, what failed, and why.

Many victories seemingly stemming from agility are the consequence of solid experiences and discipline unhindered by minutia. And yet, many of these victories may be short-lived if the  technical debt incurred while executing with less discipline is not managed properly.

Tragedy of the cooks

The analogy here derives from the Tragedy of the Commons, with "order" being the shared resource. In a completely unregulated environment, either by intention or natural pressure, the participants tend to exhaust the shared resource for two reasons: (1) the assumption that the resource is infinite and (2) the expectation that other parties will consume or hoard the resources faster than everyone else competing for it.

Absent some notion of externally mandated order, you end up with an ecosystem where the participants are lifted to the same level of access to the resources and neglect tending to the shared resources along a spectrum of obliviousness and forced sociopathy.

The obliviousness comes in the form of people internalizing an experience where lack of discipline simply works, unaware of other efforts happening in parallel to restore the original order to the system. You know the drill: that coworker who was asked to maintain, improve and share a few guidelines here and there to ensure some bad customer situation was avoided in the future. Then came the point where people realized that the "few guidelines" were several pages long and everyone needed to be mandated to read and follow the guidelines because bad customer situations kept on happening. Poof! Fun is over and the productivity-sapping scapegoat is standing by to take the fall.

An even more insidious side-effect is the internalization of these tragedy-of-the-commons experiences at an earlier stage of someone's the career, where they become the lenses through which the beginners will see work relationships.

By Roberian Borges
On the sociopathy extreme of the spectrum, my analogy is simple: If your team is asked to prepare a four-course meal and there are no rules about who should clean up the kitchen, there is always someone who will be bothered about the mess before the others, and that person is usually someone who has dealt with dried up batter on the counter before (give it a try) . The boorish cycle is completed when the other cooks pat the cleaner on the back, proclaim he has a natural vocation for cleaning, and mentally excuse themselves from the task from that point on.

Sometimes your most productive cooks are simply the ones who can ignore a dirty sink the longest.

The age of WIT (Winging IT)

Though I currently lean towards the "It-is-boorish" camp, I can see the allure of reduced planning and have the feeling that some of the discomfort experienced at the hands of the "it-is-boring" camp are growing pains in a generational shift. We just have to find ways to make it work at the scale it must work, then deal with the new state of things.

In this age of freemium web applications, where everyone expects everything to be free, the lines between outright market dumping and viable business models are becoming blurrier by the day - Uber and its driver's incentive program come to mind. This sort of expectation has become so ingrained in society that large swaths of the workforce simply accept the notion that products should be created under the successful (?) umbrella of freedom.

Where that camp loses me is in the expectation that (1) wonder startup efforts can be created out of thin air without something as basic as market research and (2) established organizations can be morphed into startups. The Lean Startup crowd is onto something that is very specific for the high-failure rate model expected of actual startups developing as-a-Service offerings, but that is a topic for a different posting.

At some point, when you realize most people don't like cleaning the kitchen, chastising them into doing the chore may just drag down morale and push people out. And here is the moment where I acknowledge the lost battle while still staring at the prospect of dealing with a messy kitchen.

Planned chaos

For established organizations, the solution is not to chastise the workforce into doing chores, but finding ways of avoiding the mess in the first place. One can despair and give in to chaos, give up on creating new products and going down the route of acquiring whichever small company survives the Darwinian grinder of the startup world, but that is hardly a system that is scalable or inclusive enough to support the industry as a whole.  Even them, without a solution for the cultural aspects and the right balance of discipline and freedom, these acquisitions will be doomed from the start.

By Nicole Quevillon
Learning fast and adaptability are a powerful combination of success factors, but ignoring past lessons baked into existing processes is a dangerous mix of  irresponsibility and innovation.

The acceptable compromise between camps seems to require a bit of discipline and planning upfront on how much chaos (technical debt) is survivable, how it will be measured, and how it will be remediated. As a concrete example, if a team decides on not commiting to a service level agreement in its initial offering period, will the team agree on implementing enough monitoring to at least keep track of service levels? If the team does not want to have a mandatory training program for reuse of open source software (and dragons be there) , should it spend a few hours publishing a list of licenses that are accepted?

Ultimately, a good conversation should pass through an examination and adaptation of processes.

And the discipline part? It just better be there.


* It amuses me to no end that cowboys and horses are used as common-place characters in analogies about poor practices.

Tuesday, May 31, 2016

Serverless, NoOps, and Silver Bullets

In the aftermath of serverlessconf, Twitter was abuzz with the #serverless tag and it didn't take long for the usual NoOps nonsense to follow (Charity Major's aptly named "Serverlessness, NoOps and the Tooth Fairy" session notwithstanding) .

When you look at operations as the traditional combination of all activities necessary for the delivery of a product or service to a customer, "serverless" addresses the provisioning of hardware, operating system and, to an extent, middleware.

Even when we ignore the reality that many of the services used on the enterprise will still run in systems that are nowhere close to cloud-readiness and containerization, approaches like Docker will only take you so far.

Once you virtualize and containerize what does make sense, there are still going to be applications running on top of the whole stack. They will still need to be deployed, configured, and managed by dedicated operations teams. I wrote my expanded thoughts on the topic a couple of months ago.

One may argue that a well-written cloud-ready application shoud be able to take remedial action proactively, but those are certainly not the kind of applications showing up on conference stages. Switching from RESTful methods deployed on PaaS to event listeners in AWS Lambda will not make the resulting application self-healing.

Whereas I do appreciate the "cattle-not-pets" philosophy and the disposability in a 12-factor app , I have actually worked as a site realiability engineer for a couple of years and we still needed to monitor and correct situations where we had cattle head dying too frequently, which often caused SLA-busting disruptions to end users expecting 5 9's reliability or better.

#NoTools, #NoMethod

Leaving the NoOps vs /DevOps bone aside, when I look at event-based programming models such as AWS Lamba and IBM OpenWhisk, and put them in contrast with software development cycles, I start to wonder whether development shops have fully understood the model's overall readiness beyond prototyping.

What is the reality of design, development tooling, unit-testing practices, verification cycles, deployment, troubleshooting, and operations? As an example, when I look at OpenWhisk,  I see NodeJS, Swift and ... wait for it... Docker. There is your server in serverless, unless you are keen on retooling your entire shop around one of those two programming languages. Updated on 7/29: OpenWhisk has added support for Java since this post was originally written, but you definitely don't want to know how it is implemented behind the curtains.

At the peril of offering anecdotes in lieu of an actual study, some of the discussions on unit testing for event handlers can go from clunky to casually redirecting developers towards functional testing. And that should be the most basic material after debugging, which is also something conspicuously absent.

Progress is progress and the lack of a complete solution should bever be a reason to shy away from innovation, but at the same time we have to be transparent about the challenges and benefits.

If the vision takes a sizable number of tinkerers building skunkworks on the new platforms, that is all good, but we have to realize there is also an equally sizable number of shops out there looking for the next silver bullet. These shops will be quick to blame their failures on the hype rather than on their own lack of understanding of the total cost of development and operations of a cloud-based offering.

Click-baiting of dead development methods is well and alive for a reason, until you realize the big development costs depend more on the Big and Complex stuff than on how much time developers spend tending to pet servers under their desk.

As the serverless drumbeat continues, it remains to be seen whether we will witness an accompanying wave of serious discipline prescribing the entire method before another one is put out as the next big thing.

The obvious next step would be codeless code, which is incidentally the name of one of my favorite blogs. It contains hundreds of impossibly well-written and well-thought out material about software development, including this very appropriate cautionary tale on the perils of moving up the stack the concerns without understanding how the lower layers work.

Wednesday, November 11, 2015

DevOps: On walls and trenches

Know thy wall, mind your trenches. geekish alert.

A few years back, a colleague introduced the notion of DevOps to an internal large audience as "we want to be more like [insert your favorite SaaS startup here]", much to the delight of various management and executive teams, besieged with mounting back-pressure in the sales pipeline as a result of long customer deployment cycles.

The enthusiasm after those types of sessions stems from the general notion that a new world of productivity ensues once the walls come down between the development and operations teams, with multiple deployments a day, with every new feature reaching customer hands minutes after delivery to code streams. Ready for usage, ready for sale.

Beware the strawman. Within hours, one may have a demolition crew, looking for the walls about to be brought down, hammer in one hand, clenched fist on the other, both united by a chest full of seething rage against the walls. If you ever find yourself leading such a mob, pause for a moment...actually for two moments, during which I need to offer you the most important advice in the art of bringing down walls: "know thy wall".

Is it a wall?

Berlin wallSometimes organizations do not operate their products, they simply build them and sell them to other shops, who are then responsible for standing up hardware, loading up the software and relying on a long chain of support streams to relay any software problem back to its manufacturers.

As a software developer, you are insulated from the good and the bad. There is little access to feedback on how the software is used by end users, even less feedback about how it is installed, configured and managed. There is also less contact with upset customers and minimal exposure to the funny hours at which the systems decide to act up on the myriad of defects that may escape the development cycle.

If an organization operates under that model, it is living inside a bunker. That is understandably the audience most attracted to the wall-bashing revolution, but for the wrong reasons. Energy would be better spent moving into a SaaS business model than attempting to influence operations teams likely outside their control.

Why was it built?

Assuming you passed the first test, you are doing at least SaaS and you have a proper wall between your development and your operations team.

Refrain from a complex of grandeur and realize your wall is not of the tyrannical country-splitting kind, but of the garden-variety blueprint, such as the ones built for property protection, sound insulation or soil retention. In other words, unless the underlying motivations behind the construction of the wall were addressed over time, your wall still serves a purpose.

The reason most walls between development and operations were built is because (and brace for the bar brawl) software development and systems administration are fundamentally different activities.

A software developer is specialized in shaping up a deliverable from thin air, from inception, to elaboration, to construction (coding, testing), to transition (to operations) . Resist the urge here, for a moment, to declare this the "old way" of building software, because these phases still exist even in the wildest agile lean-guild-squad-pizza-night-sleeping-in-the-office deployment cycles.

System administrators are specialized in planning deployment cycles, provisioning systems, wiring them together, loading them with software, rigging everything with probes and hoping the systems stay really quiet and out of sight while repeating the entire cycle.

A thinner wall is still a wall

Cranes, Pines, and BambooI have been on both development and operational sides of the wall and it is really disheartening to see the amount of misinformed passion thrown into the conflation of the continuous pipelines advocated in the DevOps method with the conclusion that development and operations can be unified under a single organization (or tribe if you are so inclined) .

Mix passion, misinformation, a pronounced shortage of trained system administrators, and many organizations may soon find themselves falling in the trap of really tearing down the walls and start assigning their SaaS application developers to operate the platform. Soon they start to realize what was behind that wall: operating systems, security patching, operational architecture, scalability for log retention systems, compliance, alerting policies, escalation policies, on-call schedules, maintenance windows, war rooms, and many other tools and processes that will eat into the development resources disproportionally to the time invested into retraining developers to perform those activities.

As software development manager, if you are ever invited by someone with a hammer for a wall-tearing party, politely redirect the conversation to a proper read of "The twelve-factor app", and emphasize the need of getting rid of the trenches (see below) versus tearing down the walls.

I have seen many debates along the lines of the excellent comments section on "I Don't Want DevOps. I Want NoOps", which conflates the reduced operational costs of running an application on top of a PaaS stack with having no operational needs whatsoever. I can attest to the reduced costs of development and operations in such arrangement, but they are still distinct activities that require a different skill set unless one tries really hard to confuse the development of operational tools (e.g. an automated generator of trouble-tickets) with the development of the application providing the function to the end-users.

Beware of the trenches

World War I Marines in a Trench, circa 1918The worst enemies of faster delivery cycles are not walls between development and operations, but rather the trenches both camps have dug over time. The true DevOps allure is really in getting both sides out of the trenches and shaking hands.

A few examples of software features loved by the operations teams, where continuous interaction and improvements can really make the software shine on the operations floor:
  1. "Pets, not Cattle" architecture. All components should be horizontally scalable and disposable. Databases are a bit trickier when it comes to being disposable, but there are a number of resilient architectures based on redundant data and hot-standby servers.

  2. Database High Availability and Disaster Recovery as integral part of the architecture. Many database technologies offer a whole spectrum of trade-offs in its many alternatives for HA+DR and the application owners have to be explicit about the interrelationships between the application and these trade-offs. For instance, a database technology may offer different settings for transaction synchronization across primary and standby nodes, some favoring transaction speeds, others geared towards complete reliability. There is a fine line between "my application can work with 2 of these 3 modes, mode A sometimes allows data to be lost and the system completely implodes when that happens" versus "our app uses a database that supports HA+DR, I am sure it will be ok."

  3. Automated delivery pipelines *for good quality* software: A continuous pipeline delivering new software versions every hour may sound like a nightmare for an operations team, but only when the outcome of every build is full of regression problems. There is still room for behavioural changes in the software that may throw off the operational monitors and procedures, but there is always the next bullet.

  4. Documented key performance metrics: One of the most respected software developers on my book once said "read the code", but realistically, not everything under the operations roof is open-source, properly written, or simple enough to be as consumable as proper documentation. That list of metrics, paired with the written explanation of their implications to end-users, are fundamental artifacts for an operations team to rig the software with all their probes, watch for the right things and trigger the right alarms.

  5. Documented configuration settings: Once again, "read the code" is just not enough. The operations team needs a full list of configurable settings, their data types, their ranges, and a few paragraphs about the implications of changing the values.

  6. Health end-points: It is a RESTful world out there, any self-respecting SaaS offering must have a simple URL available to the operations team to get an immediate internal view of the SaaS health, containing basic metadata (version, name, development support page, others as needed) , connectivity data about the status of system dependencies (e.g. database at a given URL is down) , status of various system functions (e.g. console login is down) . Structured APIs, please. JSON or XML are good starting points since they have readily available parsers for virtually all programming languages.

  7. Statistics end-points: Once again, it is a RESTful world out there, whenever an end-user (or a probe) reports slow response times, applications must offer a URL that allows a system administrator to quickly gauge the response times grouped by worst, best, mean time, median , calculated and grouped by different intervals of time, such as "last 5 minutes", "last 30 minutes", "last 12 hours", etc. One can successfully argue that the statistical aspect could be handled by the monitoring infrastructure, and one could be right.

  8. Support for synthetic transactions: Tracking down the causes for a slow system requires a deep understanding of the underlying sub-transactions invoked by the end-user system. The application should expose dedicated RESTful endpoints (in the form of different URLs, special headers or query parameters) that return a breakdown of the transaction across all component systems. Naturally, there should be documentation about the list of synthetic transactions, along with their respective breakdown and linkage to the exact address of the systems called in each sub-transaction.

  9. Administrative logs: End-points and synthetic transactions go a long way towards initial system troubleshooting, but when these less expensive means fail to surface what is happening to the system, it is time for painstaking scrubbing of system activity. A well-thought out logging strategy with clear references to key moments in the system, using terminology aligned with the system architecture, is essential in guiding system administrators towards the root cause of a problem. A good minimum set would be log entries for lifecycle events (start completion, shutdown request, shutdown complete, config change) , remote call failures, and periodical self-checks and logging of abnormalities found in #6, #7, #8.

  10. Access to the QA testcases, hopefully written using a set of technologies agreed upon with the monitoring team. If you look hard enough, anything that assures the proper functioning of the system at development time may be useful during the regular operation of the system. Imagine, for instance, an expensive QA module that simulates an end-user creating a system account, changing the account password, logging out the user and logging back in with the new password. Now imagine how the actual production authorization system may be subject to load-balancing and replication policies where that particular sequence may break for a period of time and impact end-users. The operations team can definitely benefit from simply letting that testcase run under the monitoring layer on a continuous basis and alert operators in case of failures.
A few examples of things the development teams really appreciate from their operations organization:
  1. Access to the incident database: leaving aside surmountable aspects such as the eventual need to obfuscate the customer identity, there is obvious value in knowing about critical system failures, the timeline of resolution, the steps taken by the ops team to detect and to resolve the problem. All this information can be immediately applied to drive improvements to most points raised before, such as new tests in the delivery pipeline, additional performance indicators, additional information in the health endpoints, additional configuration settings, and many others.

  2. Access to the live data for health and statistics endpoints: once again, leaving surmountable concerns aside, such as security and credential management, there is immediate value for the development team to study the correlation between customer loads and the system metrics, such as increase in response times as the number and nature of requests change over time.

  3. Access to the application logs: in an age of SaaS offerings for log aggregation, application development teams really do not need much from their operations team in this regard, but if the organization strategy calls for in-house log aggregation systems, then it is imperative that application developers have complete access to their own application logs.

  4. Access to the monitoring data for synthetic transactions: the previous examples allow a development organization to build their own data collection and aggregation system, but the ensuing duplication of efforts is rather counter-productive. 
Many developers will point out that nothing stops them from coding back-doors into the system to get access to the system data, but there should always be full-disclosure of such back-doors to the operations team, at best so that there is awareness, at worst so that compliance laws are not violated (e.g. a backdoor that allows access to user information could be in direct violation of privacy laws) .

OaaS, the trenches reinvented, for better or worse

It is a new world of productivity where smaller organizations can put out complex solutions that would rival a large organization from 10 years ago.

Development, provisioning and monitoring tools have become accessible to the point of reaching critical mass of adoption, whether as commodities available for local deployment in a data center or as full-fledged SaaS offerings that obviate the need for local deployments. That said, tools, systems, and processes are not at the singularity point where operations can be treated as an extension of a development cycle.

There are nascent efforts in Operations as a Service that will be very interesting to watch in the coming years, specially in relation to PaaS offerings and how much customization will be offered by OaaS providers to fit existing DevOps pipelines. It will be even more interesting to measure their success against DevOps pipelines being made increasingly available as add-ons in the PaaS offerings themselves.

Realistically, I think OaaS will be a niche offering akin to software development outsourcing, with the accompanying explanations on how this time it will be different than the first time (other than it won't) .

In my opinion, the current crop of companies co-opting the acronym are doing a disservice to what true OaaS should be: a natural evolution of PaaS where a standard (we still do those, right?) will need to be created to establish the interfaces between applications and the operations floor before any mass progress can be made on shielding development organizations from attempting to master operations, while still allowing the development teams to retain full control of their DevOps pipelines.

Monday, November 25, 2013

What is your problem? – Part 2: Real versus imaginary problems

To ask "What is my contribution?" means moving from knowledge to action.
The question is not: "What do I want to contribute?"
It is not: "What I am told to contribute?"
It is: "What should I contribute?"
Peter F. Drucker

In the first part I covered the general aspects of identifying and reporting problems. Now it is time to apply those concepts to my domain of choice: software engineering (minor apologies to the software gardeners out there, I will come back for you in a future posting) .

I use the Agile method as the backdrop because it has completely overtaken the field to the point of erasing debate on alternatives (defeated waterfall proponents are still called through the backdoor to fill the gaps with valuable contributions, but I digress) .

As a short Agile method recap, work is delivered in small iterations called stories and executed over relatively short intervals, called sprints. If you can tell a story while on the run (sprints, running, see what I did there?) , you know sprint durations can range from ear-bleeding single-week sprints to waterfall-bordering 8 weeks. Beyond 8 weeks, there be dragons and T-virus, walk backwards slowly towards the nearest door and avoid eye contact at all costs.

The drudgery of tasks…

A fundamental tenet of Agile method is that stories be written in the form of “As a [role name] I need [a feature] so that [I can achieve a goal]”.

I personally prefer to replace “achieve a goal” with “solve [one of my] problems”. There are far more people in this world facing immediate problems than people who have goals, and even for the people who have both, the immediate problems tend to grind one’s will and resources to execute on long term vision.

That is not to mean Agile should be cast as a reactionary method that can only tackle situations after they have become a problem, but that any complex project can be mapped to a mind-map of tasks, where each node can be represented as a problem to be solved.

… is no match for the challenge of a problem

The payoff for such mental gymnastics is that a problem statement engages, whereas a task dehumanizes, and success in software development hinges on engagement: on engagement between developer and customer, on engagement between user and solution, and as a more recent phenomenon, engagement amongst users.

It is part of many professions to walk into an engagement where the customer knows exactly what they need, are willing to pay for those services, watch you walk out of the door after completion and then deal with the next task in a master project plan.

That is invariably not true of software development for two main reasons: (1) our largely INTJ subtlety-loving personality is prone to invent dozens of different ways of achieving the same goal with dozens of distinct advantages and disadvantages for each choice, (2) we collectively get bored of those solutions more often than we should, reinventing the field every couple of years in a way that is utterly incompatible with what we once thought to be a good idea.

With all that said, as a prospective customer to a solution requiring software, whenever you engage a developer or a development shop, the first part of your homework is to be absolutely sure the problem you want solved is one of your top-most problems, lest you (or your company) may not sustain the motivation to see it through. As the person (or company) experiencing the problem, and this may seem outrageous since you are about to pay for contracted services, you will be integral part of the solution: there will be hard questions to be answered at some cost of time, intermediary solutions to be attempted, final validations to sign off, all activities that will require your attention and resources.

Know your problems…

As a prospective software provider, accepting a task at face value is a recipe for disaster.
A good software architect must know how to artfully act as a devil’s advocate while engaging a customer, not to blindly question motivations, but to understand why the customer needs a solution. In other words, a good architect will ask the customer “what is your [his] problem?”
I often hear from peers disgruntled with the fact a great idea was not accepted by a potential customer, at the same time failing to recognize the problem solved was not all that important to the customer.

… know thyself

In a previous project I joined a team which had developed an internal tool for analyzing log data from hundreds of products. At the time I was the enablement lead for the technology and really got behind it. We persevered for a long while to make our world-wide support team adopt the(internal) product. After a relatively extended period of … err…lukewarm responses, we changed our approach, meeting frequently with these support teams and also with another internal development team who was already successfully supplying tools to the support organization.

It became clear that our log analyzer tool, however sophisticated in what it could do with log files, required memory capacity that, although readily available to our development team, was unthinkable to a support engineer. This tool also had the ability, developed at great expense, to shift the memory requirements to a relational database to cut down on memory usage, but deploying and maintaining a relational database was equally unthinkable for an audience which had no expertise in managing such systems.

The question no one asked…

At the same time, meeting with the more successful internal development team revealed their key selling point to the support organization: their solution was based on a SaaS model and the support teams could access most of its function through web interfaces, avoiding the need for high-end systems and the costs of installing and maintaining new tools. Their tooling also integrated with another SaaS offering where customers submitted all the supporting information, including the all important log files, for any problem reported in the field.

In the end, building (or selling) a log analyzer to a support team which routinely performed log analysis seemed like a success story in the making, but it failed to recognize two key aspects:
  1. Their most common activity related to log analysis was to isolate error entries in log files, then use snippets of the log entry on Internet searches (incredibly effective) , a feature absent in our tool.
  2. All the information used by the support teams resided on virtual services, requiring only a web browser on their machines, which side-stepped the need for high-end systems.
At the time, and this was a remedial approach to not having asked the “what is your problem” question first, we decided to harvest the analysis internals from the tool, put it under the web-based interface already being used by the support team and surface the analysis results through a page that only contained warning and error messages, with a quick link to an Internet search based on the error message.

…is the problem no one had

Had we asked the harder questions first, the answer would have been a SaaS version of the patent we filed a few months later, where log files submitted to customers were automatically analyzed, cross-searched on websites, and results ranked according to their rate of incidence in results. At that point I left the team for other reasons, but I am told they kept on delivering on that vision.

Something fantastically useful eliminates a problem that is really at the center of someone’s attention, and not something you set out to improve, however successfully. Success also involves far more than technique and technology. In fact, too much technology may just put the solution out of reach for the target audience, in requiring system upgrades, training, and adaptation.

In closing, I end with a quote that symbolizes the consequences of offering a solution on the basis of vision without sufficient understanding of the problem, resulting in one of the costliest mistakes of the kind in recorded history:

We could hardly dream of building a kind of Great Wall of France, which would in any case be far too costly. Instead we have foreseen powerful but flexible means of organizing defense, based on the dual principle of taking full advantage of the terrain and establishing a continuous line of fire everywhere.André Maginot
December, 1929

Featured Post

Crowds in the clouds, a brave old world