Anticipating Problems with Infrastructure as Code in DevOps

Nilanjan
6 min readOct 28, 2020

There are different interpretations of the term, ‘DevOps’. However, in the DevOps values — CAMS (Culture, Automation, Measurement, Sharing), culture is considered the most important value.

The methodology begins by looking for the right people or person to fulfill a needed role. Secondly, ensure that role’s process is correct and optimal. Lastly, improve the process’ efficiency by reducing steps through some tool based automation. Notice, tools and automation are last!

People Over Process over Tools, Alex Honor.

Despite the focus on culture, collaboration and organization change, a critical part of DevOps is the enabling technology. There are different aspects of the technology. Continuous delivery automates the entire process of delivery, including building the software, validating the build and progressing through other steps in the form of a pipeline. A much more elaborate part of the technology is Infrastructure as code (IaC). Infrastructure as code was pioneered by Amazon Web Services (AWS). There are newer entrants such as Microsoft Azure and Google Cloud.

The main concept of an IaC service, like AWS, is to access infrastructure such as servers, databases and networks, as virtual resources. It’s critical to be able to create and access the infrastructure programmatically. You can then treat the scripts which interact with the infrastructure as software and use software development principles, to develop and manage your code/infrastructure, such as source code control and continuous delivery (of infrastructure). IaC provides the ability to scale up and scale down components at will, greatly increasing the economy while meeting customer/market demands.

DevOps is often introduced with examples from Netflix and Facebook. However, the software industry is vast. It includes both software product companies like Netflix, but also includes enterprise software development in banks, telecom, government. It also includes service providers. Given the universal availability of cloud-based infrastructure and the accessibility due to low cost and the pay-as-you-go model, the size of teams adopting IaC can vary widely. The two-person team creating a mobile app, or the government department creating a health portal, is very different from a team in Netflix. The challenge for teams is preventing problems when deploying IaC as part of DevOps.

How can you anticipate problems?

One way to anticipate problems is to ask questions.

A database as a service, like AWS RDS, is very appealing both from a technical and business point of view. However, what happens when there is downtime, even for a few seconds, due to routine maintenance? For non-critical applications it may not matter (you may still want to find out if the application is affected). For critical applications, you may want to create a highly available solution.

If you are working on an existing application, how do you replace the single instance of the database with a highly available solution? Are there any maintenance windows? What is the nature of the business? Is the business seasonal?

Here are some questions we might ask:

  • For a database as a service, what happens when there is downtime, due to maintenance?
  • Does the business have a maintenance window for replacing the database?
  • What is the nature of the business? Can they afford downtime?
  • Is the business seasonal? Are there peak times when they can’t afford downtime?

You could generalize the first question as, ‘For cloud-based service, what happens when there is downtime?’ You could also ask, ‘For cloud-based service, what happens when you need to upgrade the version?’

Here is a question for you to practice (my response given below):

What happens when you need to upgrade the language version (e.g., node.js) of a function as a service?

As an aside: Is this a problem with a cloud-based service/Infrastructure as code? Would this have not been a problem if you controlled the server/environment, i.e., with physical servers?

How do you ask questions?

You want to ask questions which are open ended. You could ask, ‘What can cause the storage to become full on an EC2 instance?’ or ‘What could cause the instance to become unresponsive?’ The answer will depend on your application and specific circumstances. (You will probably set up alerts for disk utilization reaching a threshold. However, alerts don’t replace the thinking about what might go wrong.)

You want to take asking open ended questions a step further. What you are really doing is looking for questions that you hadn’t thought about.

The idea of asking questions might seem very unstructured. You can structure your thinking by using an approach to generate ideas. One such approach is using the mnemonic, SFDiPOT, to structure your questions. SFDiPOT stands for structure, functions, data, interfaces, platform, operations, time. You could use S-structure to think about the different components of the system such as the database, lambdas and instances and how they might pose a risk. In the case of IaC, it isn’t a physical product, but an entire platform.

You can use another mnemonic CIDTESTD (Customers, Information, Developer relations, Test Team, Equipment & Tools, Schedules, Test Items, Deliverables) to generate ideas. If you are working with an enterprise application like Adobe AEM, do your DevOps engineers have enough knowledge of the product? If not, does that represent a risk? (The letter T, stands for (Test) Team, to think about risks related to the nature of the team).

When you ask questions, you want to hold yourself and your team accountable and incorporate feedback. Retrospectives and post-mortems can identify questions that weren’t asked. You also want to identify how you could have generated ideas which would have led to those questions.

Here is what I would ask when thinking about upgrades to the environment (e.g. node.js) for a function as a service, like AWS Lambda:

  • When we look at the Lambda code a few years from now, will it make sense to a new person/team?
  • Do we have automated tests for the larger functions?
  • Are there functions used for infrastructure, as opposed to, being used by the application?
  • Do we have any automated tests for the Lambda functions?
  • Are there differences in the quality of code written by the infrastructure team as opposed to the application team?
  • Will we be able to roll back the functions after upgrade, in case there are problems?
  • Is there a development environment for the Lambda functions, i.e., does the dev environment include the Lambda functions especially the ones used for Infrastructure?
  • Will the development and production environments be kept in sync (for the next few years)?
  • Can we track differences in the development and production environment?

Here are a few more general questions you could think about when working with infrastructure as code:

  • What can cause a component to not respond?
  • When can an instance not respond?
  • When can a component run out of capacity, e.g., storage?
  • Can you upgrade a component to a higher capacity?
  • Are higher capacity components always available in the region that you are working?
  • Does the region that you are working in, impact your available options? Does that impact your choices for increasing capacity in the future?

Beyond questions

There are other activities which can generate ideas on risk, for example, stress testing a component or creating a realistic customer environment, including real time data. These activities can also be used to generate questions. Asking questions is part of the much broader activity of critical thinking.

For a start, I recommend beginning with asking questions in a work situation (see example on functions as a service above). You should accompany that by using retrospectives to evaluate questions that weren’t asked and identifying the ideas or activities which could have led to those questions.

The idea of asking questions related to IaC can form a foundation for thinking about your model for anticipating problems/risk in DevOps. It may also lead the team to make inroads into critical thinking.

Notes

  1. SFDiPOT and CIDTESTD are part of the Heuristic Test Strategy Model created by James Bach (https://www.satisfice.com/download/heuristic-test-strategy-model)
  2. The idea of using questions to identify risk has been used earlier, e.g., ‘a test case is a question that you ask of the program’, in What is a good test case?, Cem Kaner, Stareast Conference, May 2003
  3. The DevOps Foundations course on Linkedin Learning is a great introduction to DevOps.

Originally published at https://www.linkedin.com.

--

--

Nilanjan

Software testing, project management, managing testers