Wednesday, November 11, 2015

DevOps: On walls and trenches


Know thy wall, mind your trenches. geekish alert

A few years back, a colleague introduced the notion of DevOps to an internal large audience as "we want to be more like [insert your favorite SaaS startup here]", much to the delight of various management and executive teams, besieged with mounting back-pressure in the sales pipeline as a result of long customer deployment cycles.

The enthusiasm after those types of sessions stems from the general notion that once the walls come down between the development and operations teams, a new world of productivity ensues, with multiple deployments a day, with every new feature reaching customer hands minutes after delivery to code streams, ready for usage and therefore, ready for sale.

Beware the strawman, within hours one may have a demolition crew, looking for the walls about to be brought down, hammer in one hand, clenched fist on the other, both united by a chest full of seething rage against the walls. If you ever find yourself leading such a mob, pause for a moment...actually for two moments, during which I need to offer you the most important advice in the art of bringing down walls: "know thy wall".

Is it a wall?

Berlin wallSometimes organizations do not operate their products, they simply build them and sell them to other shops, who are then responsible for standing up hardware, loading up the software and relying on a long chain of support streams to relay any software problem back to its manufacturers.

As a software developer, you are insulated from the good and the bad. There is little access to feedback on how the software is used by end users, even less feedback about how it is installed, configured and managed. There is also less contact with upset customers and minimal exposure to the funny hours at which the systems decide to act up on the myriad of defects that may escape the development cycle.

If an organization operates under that model, it is living inside a bunker. That is understandably the audience most attracted to the wall-bashing revolution, but for the wrong reasons. Energy would be better spent moving into a SaaS business model than attempting to influence operations teams likely outside their control.


Why was it built?

Assuming you passed the first test, you are doing at least SaaS and you have a proper wall between your development and your operations team.

Refrain from a complex of grandeur and realize your wall is not of the tyrannical country-splitting kind, but of the garden-variety blueprint, such as the ones built for property protection, sound insulation or soil retention. In other words, unless the underlying motivations behind the construction of the wall were addressed over time, your wall still serves a purpose.

The reason most walls between development and operations were built is because (and brace for the bar brawl) software development and systems administration are fundamentally different activities.

A software developer is specialized in shaping up a deliverable from thin air, from inception, to elaboration, to construction (coding, testing), to transition (to operations) . Resist the urge here, for a moment, to declare this the "old way" of building software, because these phases still exist even in the wildest agile lean-guild-squad-pizza-night-sleeping-in-the-office deployment cycles.

System administrators are specialized in planning deployment cycles, provisioning systems, wiring them together, loading them with software, rigging everything with probes and hoping the systems stay really quiet and out of sight while repeating the entire cycle.

A thinner wall is still a wall

Cranes, Pines, and BambooI have been on both development and operational sides of the wall and it is really disheartening to see the amount of misinformed passion thrown into the conflation of the continuous pipelines advocated in the DevOps method with the conclusion that development and operations can be unified under a single organization (or tribe if you are so inclined) .

Mix passion, misinformation, a pronounced shortage of trained system administrators, and many organizations may soon find themselves falling in the trap of really tearing down the walls and start assigning their SaaS application developers to operate the platform. Soon they start to realize what was behind that wall: operating systems, security patching, operational architecture, scalability for log retention systems, compliance, alerting policies, escalation policies, on-call schedules, maintenance windows, war rooms, and many other tools and processes that will eat into the development resources disproportionally to the time invested into retraining developers to perform those activities.

As software development manager, if you are ever invited by someone with a hammer for a wall-tearing party, politely redirect the conversation to a proper read of "The twelve-factor app", and emphasize the need of getting rid of the trenches (see below) versus tearing down the walls.

I have seen many debates along the lines of the excellent comments section on "I Don't Want DevOps. I Want NoOps", which conflates the reduced operational costs of running an application on top of a PaaS stack with having no operational needs whatsoever. I can attest to the reduced costs of development and operations in such arrangement, but they are still distinct activities that require a different skill set unless one tries really hard to confuse the development of operational tools (e.g. an automated generator of trouble-tickets) with the development of the application providing the function to the end-users.

Beware of the trenches

World War I Marines in a Trench, circa 1918The worst enemies of faster delivery cycles are not walls between development and operations, but rather the trenches both camps have dug over time. The true DevOps allure is really in getting both sides out of the trenches and shaking hands.

A few examples of software features loved by the operations teams, where continuous interaction and improvements can really make the software shine on the operations floor:
  1. "Pets, not Cattle" architecture. With the exception of databases, all other components should be horizontally scalable and disposable.

  2. Database High Availability and Disaster Recovery as integral part of the architecture. Many database technologies offer a whole spectrum of trade-offs in its many alternatives for HA+DR and the application owners have to be explicit about the interrelationships between the application and these trade-offs. For instance, a database technology may offer different settings for transaction synchronization across primary and standby nodes, some favoring transaction speeds, others geared towards complete reliability. There is a fine line between "my application can work with 2 of these 3 modes, mode A sometimes allows data to be lost and the system complete implodes when that happens" versus "our app uses a database that supports HA+DR, I am sure it will be ok."

  3. Automated delivery pipelines *for good quality* software: A continuous pipeline delivering new software versions every hour may sound like a nightmare for an operations team, but only when the outcome of every build is full of regression problems. There is still room for behavioural changes in the software that may throw off the operational monitors and procedures, but there is always the next bullet.

  4. Documented key performance metrics: One of the most respected software developers on my book once said "read the code", but realistically, not everything under the operations roof is open-source, properly written, or  simple enough to be as consumable as proper documentation. That list of metrics, paired with the written explanation of their implications to end-users are fundamental artifacts for an operations team to rig the software with all their probes, watch for the right things and trigger the right alarms.

  5. Documented configuration settings: Once again, "read the code" is just not enough. The operations team needs a full list of configurable settings, their data types, their ranges, and a few paragraphs about the implications of changing the values.

  6. Health end-points: It is a RESTful world out there, any self-respecting SaaS offering must have a simple URL available to the operations team to get an immediate internal view of the SaaS health, containing basic metadata (version, name, development support page, others as needed) , connectivity data about the status of system dependencies (e.g. database at a given URL is down) , status of various system functions (e.g. console login is down) . Structured APIs, please. JSON or XML are good starting points since they have readily available parsers for virtually all programming languages.

  7. Statistics end-points: Once again, it is a RESTful world out there, whenever an end-user (or a probe) reports slow response times, applications must offer a URL that allows a system administrator to quickly gauge the response times grouped by worst, best, mean time, median , calculated and grouped by different intervals of time, such as "last 5 minutes", "last 30 minutes", "last 12 hours", etc. One can successfully argue that the statistical aspect could be handled by the monitoring infrastructure, and one could be right.

  8. Support for synthetic transactions: Tracking down the causes for a slow system requires a deep understanding of the underlying sub-transactions invoked by the end-user system. The application should expose dedicated RESTful endpoints (in the form of different URLs, special headers or query parameters) that return a breakdown of the transaction across all component systems. Naturally, there should be documentation about the list of synthetic transactions, along with their respective breakdown and linkage to the exact address of the systems called in each sub-transaction.

  9. Administrative logs: End-points and synthetic transactions go a long way towards initial system troubleshooting, but when these less expensive means fail to surface what is happening to the system, it is time for painstaking scrubbing of system activity. A well-thought out logging strategy with clear references to key moments in the system, using terminology lined up with the system architecture, is essential in guiding system administrators towards the root cause of a problem.

  10. Access to the QA testcases, hopefully written using a set of technologies agreed upon with the monitoring team. If you look hard enough, anything that assures the proper functioning of the system at development time may be useful during the regular operation of the system. Imagine, for instance, an expensive QA module that simulates an end-user creating a system account, changing the account password, logging out the user and logging back in with the new password. Now imagine how the actual production authorization system may be subject to load-balancing and replication policies where that particular sequence may break for a period of time and impact end-users. The operations team can definitely benefit from simply letting that testcase run under the monitoring layer on a continuous basis and alert operators in case of failures.
A few examples of things the development teams really appreciate from their operations organization:
  1. Access to the incident database: leaving aside surmountable aspects such as the eventual need to obfuscate the customer identity, there is obvious value in knowing about critical system failures, the timeline of resolution, the steps taken by the ops team to detect and to resolve the problem. All this information can be immediately applied to drive improvements to most points raised before, such as new tests in the delivery pipeline, additional performance indicators, additional information in the health endpoints, additional configuration settings, and many others.

  2. Access to the live data for health and statistics endpoints: once again, leaving surmountable concerns aside, such as security and credential management, there is immediate value for the development team to study the correlation between customer loads and the system metrics, such as increase in response times as the number and nature of requests change over time.

  3. Access to the application logs: in an age of SaaS offerings for log aggregation, application development teams really do not need much from their operations team in this regard, but if the organization strategy calls for in-house log aggregation systems, then it is imperative that application developers have complete access to their own application logs.

  4. Access to the monitoring data for synthetic transactions: the previous examples allow a development organization to build their own data collection and aggregation system, but the ensuing duplication of efforts is rather counter-productive. 
Many developers will point out that nothing stops them from coding back-doors into the system to get access to the system data, but there should always be full-disclosure of such back-doors to the operations team, at best so that there is awareness, at worst so that compliance laws are not violated (e.g. a backdoor that allows access to user information could be in direct violation of privacy laws) .

OaaS, the trenches reinvented, for better or worse

It is a new world of productivity where smaller organizations can put out complex solutions that would rival a large organization from 10 years ago.

Development, provisioning and monitoring tools have become accessible to the point of reaching critical mass of adoption, whether as commodities available for local deployment in a data center or as full-fledged SaaS offerings that obviate the need for local deployments. That said, tools, systems, and processes are not at the singularity point where operations can be seen just as extension of a development cycle.

There are nascent efforts in Operations as a Service that will be very interesting to watch in the coming months, specially in relation to PaaS offerings and how much customization will be possible in the OaaS provider to fit existing DevOps pipelines, specially when these pipelines are becoming increasingly available as add-ons in the PaaS offerings themselves.

Realistically, I think OaaS will be a niche offering akin to software development outsourcing, with the accompanying explanations on how this time it will be different than the first time (other than it won't) .

In my opinion, the current crop of companies co-opting the acronym are doing a disservice to what true OaaS should be: a natural evolution of PaaS where a standard (we still do those, right?) will need to be created to establish the interfaces between applications and the operations floor before any mass progress can be made on shielding development organizations from attempting to master operations, while still allowing the development team to retain full control of the DevOps pipeline.

No comments:

LinkedIn