Production readiness checklist: ensuring smooth deployments

Internal developer portal

Production readiness can be defined as the process that ensures that specific software components are secure, reliable and are able to perform at the level expected. Achieving production readiness can help reduce the chances of downtime, minimize the number of critical incidents or failures, and provide users with a better experience.

The idea stemmed from the production readiness review process in Google’s SRE book. The process relies on numerous factors that all play a role in making software production ready; but these factors look different in different engineering organizations.

Production readiness is difficult to achieve, partly because it requires incorporating a number of standards across the software development lifecycle such as reviewing code, testing, monitoring, security and access controls, documentation, deployment workflows and more. The process touches everything from code to post-production ops and additionally, the organization’s production engineering requirements can change over time.

In simpler terms, production readiness is similar to ‘definition of done’; an idea born from the product management world.

The idea is that ‘the definition of done’ means that different stages need to be completed in their entirety to be considered ‘done’, but nothing in the software world is ever ‘done’, as it has to be continually and consistently reviewed, monitored and maintained. For example, if a service was production-ready when it was scaffolded, it won’t necessarily remain that way over time because requirements change and services (and their components) can degrade.

There are different ways to actually ensure the review of the production readiness checklist:

Ensuring that reviewing and maintaining lists is easy and doesn’t involve manual work is important; even more when services that have already deployed need to be reviewed for production readiness.

Importance of a production readiness checklist

A production readiness checklist is exactly what it sounds like - a ready-made list of everything that you need to check about your software for production readiness.

Ensuring that software is production-ready is closely tied to software standardization; all of which encompass the necessary steps to ensure smooth operation in a live environment.

The checklist is important because it means that from an engineering viewpoint the service has the appropriate resilience, security and performance; a slowdown or shutdown of your software or a breach could have a hugely detrimental impact on your business’ reputation and bottom line. As an engineering team, it could also have negative consequences internally. On the flip side, using a checklist can improve the user experience, and as a byproduct you can retain (and grow) customer trust and revenue.

Getting started with a production readiness checklist

Before we get into what you have to include in your list, there are some things you should bear in mind when planning your production readiness checklist:

The checklist shouldn’t be static

The software development life cycle is continually evolving and so new frameworks, dependencies and technologies should be factored into any checks. In fact, because of this evolution, the checklist, just like ‘the definition of done’ should be considered as the first steps of your production readiness checks, but not your final steps as even if you’ve ensured there are no bugs or vulnerabilities before deployment, there needs to be a way to check that new vulnerabilities have not appeared and that there is an approach to resolving such vulnerabilities embedded in the organization’s standards. This is where ongoing review of readiness comes into place.

Automated production readiness checks are vital

While the whole idea of using a checklist sounds like a manual approach in itself - to improve the efficiency and accuracy, it should evolve to automated checks. The only real manual approach of the checklist is compiling the list itself and verifying that it makes sense with all stakeholders.

When it comes to the checks themselves, each organization will have their own approach but it’s clear that manual checks - using spreadsheets, project management software or Configuration Management Databases (CMDBs) are inefficient and may not be up-to-date, which can subsequently hinder the trust that engineers have in the process. Automated checks, which rely on scorecards of production readiness, using internal developer portals, can monitor and validate readiness criteria on a continuous basis, and can consistently perform checks without human error. The automated aspect also enables these checks to go a step further; providing alerts when issues arise, and then enforcing policies, triggering tests and validating configurations; all providing a more efficient and reliable process.

Checklists vary greatly and are difficult to put together

Creating a production readiness checklist is challenging due to the diverse requirements of different software components (e.g., APIs, microservices). These standards vary based on numerous factors, including the infrastructure, underlying technology, and the role of each component within the overall engineering ecosystem.

Each organization requires its own set of production readiness metrics and checklists tailored to its unique:

Core components of a production readiness checklist

A comprehensive production readiness checklist for a service addresses multiple factors to ensure it’s good to go. These include:

Security:

Scalability

Reliability:

Observability:

Ownership

Incident Management

Addressing these areas ensures software is production-ready, capable of meeting user demands and maintaining reliability throughout its lifecycle.

Not all services need to track every metric listed; additional metrics might include FinOps, specific Kubernetes standards, or application security standards, which aren't always part of SRE activities.

Where to store the production readiness checklist

Where you store your checklist matters because it may impact how easy it is to find, use, update and even delete (by mistake!). Often, companies will store the checklist inside the GitHub repo as a markdown (.md) file; the benefit of this is that it is in the same space as code, and won’t get lost, but the downside is that it might not be as easily accessible. Alternatives include spreadsheets, which, just like the checks themselves, can be a painstaking exercise to use and manually update.

Key takeaways

In conclusion, a production readiness checklist is essential for guaranteeing that your services are secure, scalable, reliable, and observable. It also plays a critical role in implementing continuous integration and deployment (CI/CD), setting service level objectives (SLOs), and establishing robust disaster recovery and rollback plans. Incorporating these elements from the initial launch and throughout subsequent updates ensures the ongoing health and effectiveness of your services.

Learn how you can manage production readiness in an internal developer portal in this guide.