DevOps Defect Catalog — from 1988’s Cem Kaner

7 min readNov 1, 2020

Good ideas in software testing endure, despite changes in technology and development approaches. In 1988, Cem Kaner first described a ‘defect catalog’ for software in his landmark book, Testing Computer Software. In this blog post I describe how to create a defect catalog for Infrastructure as Code as part of DevOps.

Kaner’s defect catalog was a list of common errors in software. He listed the different ways you can use such a list:

Evaluate test materials developed for you by someone else.
Developing your own tests.
Generate hypotheses for bugs which are difficult to reproduce.
Generate related ideas for unexpected bugs.

Here are a few examples of defects in Kaner’s defect catalog:

Abbreviating delete to del but list to ls and grep to grep makes no sense.
— Cem Kaner
Undo is desirable. Undelete is essential.
— Cem Kaner

A defect in a defect catalog is more general than a defect logged for a particular software. A customer or a tester (scrum developer) might complain that when importing data into a spreadsheet, he can’t quit midway. This would include specific steps to reproduce the specific problem and workflow. In a defect catalog, this would be a more generic learning — ‘Can’t stop mid-command’. This generic idea can be applied to many different situations, whether it’s a wizard or other workflow.

A defect catalog won’t be used like a checklist. You can use it to generate ideas on what might go wrong (#2 and #3 from the preceding list). You still need to do the hard work of determining whether the defect is relevant to your context.

Defect catalogs are dependent on the context. When you are creating a bar chart, you may want to sort the axis categories manually and not alphabetically, e.g., High, Medium, Low (instead of High, Low, M edium). This might translate to a defect in the catalog, ‘Sorting depends on the type of data’. When you are looking at usernames in a password manager, sorting the list alphabetically is sufficient. The defect for sorting may not be of value for the password manager.

Infrastructure as Code in DevOps Defect Catalog

In an earlier article I had written about how you can anticipate problems in infrastructure as code as part of DevOps. Although, the word, DevOps, is focused on collaboration and organization change, there is significant technology used in DevOps. I have listed a few defects, and their descriptions, below, in the form of a defect catalog. The focus is on potential problems when creating Infrastructure as Code.

Risk of using defaults

When you work with a service, could you be using any defaults which pose a security risk? Is it possible that you are using some default components without thinking about them?

When you create new EC2 instances in AWS, a default security group is used. It’s possible that you might be using the default groups, instead of creating groups with custom and specific permissions.

Given the large number of services in a platform like AWS, you may be using defaults which can pose a risk.

This may be a bigger problem with service providers or teams working with multiple vendors, or those working on multiple projects. Each project may have a different setup and different set of services. In that case, you may use defaults without realizing it.

Packaged components which are not up to date

Are there plug-ins, libraries or other components which are out of date? If they are not actively maintained, do they pose a security risk?

When using AMIs if the AMI has not been updated for months, it could be a security risk. Can we detect AMIs which are older than a certain number of days/months?

In the case of AMIs they may be vulnerable to operating system patches. Containers may also be vulnerable if they have not been updated.

Prevent resources from being inadvertently deleted

How do we make sure important resources are not inadvertently deleted? We can prevent a database from being deleted by setting a flag. We can prevent storage from being deleted by required MFA access. Are there other resources which we want to prevent from being deleted? Are there resources for which it doesn’t matter whether they are deleted?

Enabling MFA delete will prevent deletion of versioned S3 objects.

Unused resources

Are there any resources which are not used? Unused security credentials may be a security risk. Are there access keys which are not being used? Can we detect unused resources?

How can you convert risks into a defect catalog?

I created these defects, as part of a defect catalog, from the rules used by a tool for Infrastructure security and compliance, Cloud Conformity (from Trend Micro). If you are unsure about what type of defects you can expect from IaC, Cloud Conformity’s rules provide great insight. While security and performance are obvious risks for IaC, Cloud Conformity provides insights into the five pillars of AWS’s well-architected framework — Operational Excellence, Security, Reliability, Performance Efficiency and Cost Optimization.

I analyzed each rule and converted it into a generic version. For example, while we want to prevent AWS S3 buckets from being accidentally deleted, we also want to make sure that other resources are not accidentally deleted.

Cem Kaner’s original defect catalog provides many ideas on thinking about defects. Although, the defect catalog was published much before current software development practices and technology, there are many useful ideas on thinking about what can go wrong.

Approach to anticipating risk

You could automate specific instances of the defects described. In fact, Cloud Conformity does automate the detection of these problems. However, in addition to automation, the use of a defect catalog is that you can keep thinking about new risks. Adding a defect to a defect catalog also generates new discussions and ideas related to risk:

Is this catalog defect description sufficiently generic? How do you make it generic?
Which resources might this apply to?
How do different cloud platforms handle this?
You create new models of thinking. In the second example from Kaner’s catalog, you realize that ‘Undelete’ is a special case of ‘Undo’. For your cloud based infrastructure, when do you need an ‘Undelete’ instead of an ‘Undo’?

Once you discover new risks, you can create new rules which can then be automated. Agile retrospectives and incident postmortems can generate new defects in a defect catalog. Generating new defects for a catalog is really not optional, but is part of agile’s feedback loop. The combination of thinking about defects in a defect catalog and automating defects, once you discover them pre-release or in postmortems, is a powerful approach to anticipating problems and addressing risk with Infrastructure as Code.

Notes

Cem Kaner, Jack Falk, & Hung Quoc Nguyen, Testing Computer Software (2nd Ed.), International Thomson Computer Press, 1993. The first edition was published in 1988.

The defect catalog was published in the Appendix of Cem Kaner’s book and is available for download

Cem Kaner used the word ‘taxonomy’ instead of ‘defect catalog’ in later publications: Giri Vijayaraghavan & Cem Kaner, “Bug taxonomies: Use them to generate better tests.” [SLIDES] Software Testing, Analysis & Review Conference (Star East), Orlando, FL, May 12–16, 2003

Anticipating problems with Infrastructure as code in DevOps

Cloud Conformity Rules referenced

Unused resources: https://www.cloudconformity.com/conformity-rules/IAM/credentials-last-used.html

Redundant resources: https://www.cloudconformity.com/conformity-rules/IAM/unnecessary-access-keys.html

Logging access requests: https://www.cloudconformity.com/conformity-rules/S3/s3-bucket-logging-enabled.html

Exposure due to multiple functions: https://www.cloudconformity.com/conformity-rules/S3/buckets-with-website-configurations.html

Logging access requests for resources to audit unauthorized access: https://www.cloudconformity.com/conformity-rules/S3/s3-bucket-logging-enabled.html

Prevent resources from being inadvertently deleted: https://www.cloudconformity.com/conformity-rules/S3/s3-bucket-mfa-delete-enabled.html

Redundant resources: https://www.cloudconformity.com/conformity-rules/IAM/unnecessary-access-keys.html

Implications of automatically launching resources: https://www.cloudconformity.com/conformity-rules/AutoScaling/cooldown-period.html

— — — — — — — — — — — — — -

Some more rules converted to a catalog

Detect configuration changes in resources

Can we detect configuration changes in resources? Can we detect changes which are unauthorized?

An example of this is changes to S3 (AWS Simple Storage Service) configuration.

Components with multiple functions with a less used function creating an unnecessary exposure

When you have a single component with multiple functions, you might expose the component due to misconfiguration of the less used function.

AWS S3 allows you to configure a bucket to host a website. Enabling buckets to host a website might expose data. We should periodically review buckets which are configured to host a website.

The more common use of S3 is for secure storage. It is handy to host simple static websites on S3. It’s possible to overlook buckets which host websites and create an exposure.

Logging access requests for resources to audit unauthorized access

Are we logging all access requests for resources? Is access logging enabled for the service by default? AWS S3 Server access logging should be enabled to log access requests for access. AWS S3 Server access logging is disabled by default.

Redundant resources

Resources may be created for special purposes. IAM access keys are created for key rotation. Once the keys are rotated, it is a good idea to delete them. Redundant resources have the same risk as unused resources. Can we detect such redundant resources?

Implications of automatically launching resources

An important part of IaC is the ability to launch/re-launch resources automatically. Instances may be launched to handle increased load, as part of auto-scaling. There are additional considerations when launching resources. Do instances need a cooldown period to allow instances time to start handling traffic?

Originally published at https://www.linkedin.com.