Production readiness checklist: ensuring smooth deployments

Production readiness can be defined as the process that ensures that specific software components are secure, reliable and are able to perform at the level expected. Achieving production readiness can help reduce the chances of downtime, minimize the number of critical incidents or failures, and provide users with a better experience.

The idea stemmed from the production readiness review process in Google’s SRE book. The process relies on numerous factors that all play a role in making software production ready; but these factors look different in different engineering organizations.

Production readiness is difficult to achieve, partly because it requires incorporating a number of standards across the software development lifecycle such as reviewing code, testing, monitoring, security and access controls, documentation, deployment workflows and more. The process touches everything from code to post-production ops and additionally, the organization’s production engineering requirements can change over time.

In simpler terms, production readiness is similar to ‘definition of done’; an idea born from the product management world.

The idea is that ‘the definition of done’ means that different stages need to be completed in their entirety to be considered ‘done’, but nothing in the software world is ever ‘done’, as it has to be continually and consistently reviewed, monitored and maintained. For example, if a service was production-ready when it was scaffolded, it won’t necessarily remain that way over time because requirements change and services (and their components) can degrade.

There are different ways to actually ensure the review of the production readiness checklist:

By the DevOps engineer that is performing the action (for instance, when scaffolding a service)
By the developer
Using manual lists, stored in excel sheets or Jira
Using automated checks, such as using scorecards or self-service actions in an internal developer portal.

Ensuring that reviewing and maintaining lists is easy and doesn’t involve manual work is important; even more when services that have already deployed need to be reviewed for production readiness.

Importance of a production readiness checklist

A production readiness checklist is exactly what it sounds like - a ready-made list of everything that you need to check about your software for production readiness.

Ensuring that software is production-ready is closely tied to software standardization; all of which encompass the necessary steps to ensure smooth operation in a live environment.

The checklist is important because it means that from an engineering viewpoint the service has the appropriate resilience, security and performance; a slowdown or shutdown of your software or a breach could have a hugely detrimental impact on your business’ reputation and bottom line. As an engineering team, it could also have negative consequences internally. On the flip side, using a checklist can improve the user experience, and as a byproduct you can retain (and grow) customer trust and revenue.

Getting started with a production readiness checklist

Before we get into what you have to include in your list, there are some things you should bear in mind when planning your production readiness checklist:

The checklist shouldn’t be static‍

The software development life cycle is continually evolving and so new frameworks, dependencies and technologies should be factored into any checks. In fact, because of this evolution, the checklist, just like ‘the definition of done’ should be considered as the first steps of your production readiness checks, but not your final steps as even if you’ve ensured there are no bugs or vulnerabilities before deployment, there needs to be a way to check that new vulnerabilities have not appeared and that there is an approach to resolving such vulnerabilities embedded in the organization’s standards. This is where ongoing review of readiness comes into place.

Automated production readiness checks are vital

While the whole idea of using a checklist sounds like a manual approach in itself - to improve the efficiency and accuracy, it should evolve to automated checks. The only real manual approach of the checklist is compiling the list itself and verifying that it makes sense with all stakeholders.

When it comes to the checks themselves, each organization will have their own approach but it’s clear that manual checks - using spreadsheets, project management software or Configuration Management Databases (CMDBs) are inefficient and may not be up-to-date, which can subsequently hinder the trust that engineers have in the process. Automated checks, which rely on scorecards of production readiness, using internal developer portals, can monitor and validate readiness criteria on a continuous basis, and can consistently perform checks without human error. The automated aspect also enables these checks to go a step further; providing alerts when issues arise, and then enforcing policies, triggering tests and validating configurations; all providing a more efficient and reliable process.

Checklists vary greatly and are difficult to put together

Creating a production readiness checklist is challenging due to the diverse requirements of different software components (e.g., APIs, microservices). These standards vary based on numerous factors, including the infrastructure, underlying technology, and the role of each component within the overall engineering ecosystem.

Each organization requires its own set of production readiness metrics and checklists tailored to its unique:

Business needs (eg. highly regulated industries handling sensitive data); and
Technical environments (eg. externally exposed services needing robust security measures, or adherence to specific Kubernetes standards).

Core components of a production readiness checklist

A comprehensive production readiness checklist for a service addresses multiple factors to ensure it’s good to go. These include:

Security:

Conduct vulnerability scans (are you connected to relevant scanners?)
Identify vulnerabilities through a security audit
Ensure you have SLOs and maximums set for vulnerabilities
Put role-based access controls in place
Ensure authentication and authorization methods are in place for each service
Static application security testing (SAST) using tools like Snyk to monitor code in the CI/CD pipeline.
Make sure secrets are properly managed
Perform penetration tests and dynamic application security testing (DAST) at the appropriate times
Check all dependencies are using the correct versions using scanning tools.
Implement data encryption for both data at rest and in transit
Verify compliance with industry security standards
Checks for other common malicious activities

Scalability

Ensure the architecture is designed to handle increased loads efficiently
Stress test the application’s components to check their limits
Check whether your application can handle user or data growth
Use performance monitoring for SLOs
Establish performance benchmarks and then check these are met
Automate the CI/CD release process to enhance scalability
Execute automated unit and integration tests that require passing

Reliability:

Define and monitor compliance with service-level objectives (SLOs), service-level indicators (SLIs) and service-level agreements (SLAs)
Ensure disaster recovery plans are documented and tested
Keep regular backups of data
Ensure redundancy mechanisms are in place
Include automated rollback capabilities to revert to a stable version if needed.

Observability:

Implement monitoring with comprehensive KPI and health metrics, logging, and tracing
Ensure you are alerted via preferred method (Slack, email, etc) if the status of your services change (through broken thresholds or inconsistencies)
Use dashboards for real-time status
Use logging for incidents and errors

Ownership

Identify owners of services and components, include easily discoverable contact information and methods
Map upstream and downstream dependencies
Identify and make discoverable related teams, stakeholders, and team members

Incident Management

Ensure runbooks have been documented and are accessible.
Assign on-call responsibilities for incidents
Designate owning teams for each service
Establish escalation policies
Test incident response process with a drill
Ensure on-call is able to find the information they need easily during resolution

Addressing these areas ensures software is production-ready, capable of meeting user demands and maintaining reliability throughout its lifecycle.

Not all services need to track every metric listed; additional metrics might include FinOps, specific Kubernetes standards, or application security standards, which aren't always part of SRE activities.

Where to store the production readiness checklist

Where you store your checklist matters because it may impact how easy it is to find, use, update and even delete (by mistake!). Often, companies will store the checklist inside the GitHub repo as a markdown (.md) file; the benefit of this is that it is in the same space as code, and won’t get lost, but the downside is that it might not be as easily accessible. Alternatives include spreadsheets, which, just like the checks themselves, can be a painstaking exercise to use and manually update.

Key takeaways

In conclusion, a production readiness checklist is essential for guaranteeing that your services are secure, scalable, reliable, and observable. It also plays a critical role in implementing continuous integration and deployment (CI/CD), setting service level objectives (SLOs), and establishing robust disaster recovery and rollback plans. Incorporating these elements from the initial launch and throughout subsequent updates ensures the ongoing health and effectiveness of your services.

Learn how you can manage production readiness in an internal developer portal in this guide.