Monitoring has been a very important concept in network or infrastructure operations for a long time. It has got much more attention in DevOps as part of the feedback loop on operations.
I happened to sit in on a few DevOps/Ops engineers, talking about monitoring, to a junior engineer, among others. My focus was on testing.
How do you test monitoring? What does that even mean?!!
If you are new to monitoring, you may realise there are tools available for monitoring, like Newrelic. Can we just deploy Newrelic for monitoring and be done with it? When deploying any software, it’s common to think, “Of course, there may be some glitches — that is expected”. Is it really that simple?
In these notes I don’t focus much on the construction part of monitoring. I imagine someone who’s mission is to test monitoring, if that is even possible. What questions should that person ask? You could also imagine a junior DevOps engineer, who is not experienced with monitoring, asking the same type of questions when tasked to deploy monitoring.
So, what is monitoring? I won’t get into complex definitions. If my website is slow, I need to know. This could mean temporary slowness or the entire website is down. I need to get an email. I then need to take actions to fix the problem.
I’ve shared links on the construction aspect of monitoring at the end of this article.
Process of deploying and validating monitoring
Deploying monitoring is unlike the process of deploying a software application. You will first decide what tools you want to deploy. There may not be a lot of value in deploying in a non-production environment. Deploying is also not a go/no-go decision. You can deploy in production and start collecting data. You may refine the process over time. You are dependent on data generated from real users in a real production environment. You may not be able to simulate real user behavior.
Monitoring is often deployed by one or two engineers. Teams like Netflix will have much larger teams involved in monitoring. Monitoring alerts may go to multiple team members, i.e., the larger development team.
Monitoring requires deep knowledge of many aspects of software and infrastructure. It’s natural to keep learning about technology in order to be able to ask relevant questions. When testing, you start from whatever is your state of knowledge. You then ask questions and perform experiments as part of learning. That is a different approach compared to deploying or construction.
In the rest of this article, I have given examples of questions which you might ask when deploying or using monitoring. Some of these may result from your experience, i.e., after deployment. Others may be asked during planning or deployment. I captured these from the dialogs of the two DevOps engineers.
I’ve chosen questions which are open ended. These aren’t questions which you will see in a certification exam. These questions are specific to your application and context.
- What kind of workflows should I model?
- Does the tool support browser interactions? What is the level of scripting supported?
- Will synthetic monitoring affect the application performance?
- Can I add synthetic checks to every part of the app?
- How do the tools support web UI as well as API?
- Can we use a different tool for monitoring API?
- What is the value of system related metrics — CPU, memory, disk, network?
- I know that my sysadmin is obsessed with CPU and memory graphs. (red alert — what is he missing)? What are his biases?
- Can I use these (system metrics) for alerts?
- When I am working with applications, is the server commonly the bottleneck?
- If I am working with a Java VM or a docker container, does the host memory really matter? Does the VM available memory matter more?
- How much do I need to know about the VM or about Docker?
- When I use a VM or Docker, do I have a bias towards treating it like a black box?
- Should I get the metrics from each level, i.e., container, host?
- When services are slow, I know that the first impulse is to get new hardware. How do I know if we really need new hardware or if something else is the problem?
- When working with systems on EC2, I notice that the memory graphs of the system and that of EC2 are completely different. Why are they different?
- What do I need to be careful about when looking at statistics?
- What does a CPU average of 50% mean? How does that work with multiple cores?
- Should I include a maximum along with averages?
- Do I need to know stats?
- When a linux system is performing virtual allocation and reclamation, does out of memory matter?
- How much do I need to understand about Linux memory management?
- Where can I find information about processes being killed because of out of memory conditions?
- What disk related metrics are important?
- If the disk I/O is low, can I conclude the database is down? If the app is locked, does that mean the CPU is pegged? What metrics should I really be tracking to know what is going on? Is there a bias towards jumping to conclusions based on metrics?
- Do I need to know what happens with the end user?
- What are the key metrics to track end user experience for a web site?
- Does it matter if the users are from different parts of the world? How does that affect performance?
- If I do create a chart of delays in response time, can I isolate how much is due to the front end and how much to the back end?
- Can I drill down into delays in the back end?
- What kind of metadata can I collect from users?
- How does the browser version or the client OS help me?
- How does the data from clients compare to data from back end monitoring?
- How do I make sense of a large amount of data?
- Do I need to collect data related to cookies and headers?
- Can performance be a problem with particular headers? What if a client doesn’t accept a particular type of encoding? Will that force the server to send uncompressed data affecting performance?
- Is there a problem if I capture a large amount of meta data?
- Do I need to figure out all possible types of headers and other meta data from the client?
- Would it make more sense to get metrics from the software application?
- What kind of metrics do Nginx, Postgres or even the JVM provide?
- How are software metrics different from OS level metrics?
- What more do software metrics provide compared to Dtrace?
- Does a Nagios have plugins for different software?
- Do any of these tools have paid monitoring capabilities?
- Do database vendors have specific tooling?
- Can I emit specific metrics from the application?
- How can I get metrics specific to my application?
- Can I convince the developers to instrument the application? I know this will result in whining.
- Are there any libraries that can allow me to push metrics into something like Statd?
- Can I get business related metrics, like dollars sold or number of users signed up?
- Can I aggregate these metrics in a monitoring tool?
- Why do we need to monitor applications?
- Can’t we just look at CPU graphs?
- Can application logic problems cause performance issues?
- Have applications been tested with production workloads?
- Can dependencies cause performance problems?
- Can we have too many metrics? How do we draw the line?
- In addition to monitoring, can we track what the application is doing throughout the usage?
“Network monitoring includes getting visibility into the health of networking devices and what transpires on the wire.”
- Why do we need network monitoring?
- What tools can we use for network monitoring?
- Should network monitoring be handled by the network operations team?
- Can we just off load the work to the network monitoring team? How do we do that?
- What should be measured for a network?
- How do we scale network monitoring?
- What tools can we use to diagnose a network?
- Do we need custom logging for our application?
- How can logging help with production problems?
- How is logging different from tracing?
- What should be the structure of log files — text, JSON, XML?
- How do we find the information we are looking for?
- How will it scale?
- What are the costs of logging? Do commercial vendors charge for log volume?
- How do you balance between Splunk and ELK?
- How do you convince developers who argue that logging will affect performance?
- What types of log levels should we use?
- How do we manage disk space used by logs?
- What are the qualities of a good log file?
- Should we look at sources of events other than logs?
- How do we prevent sensitive information in logs?
- Can we have too much data that it is a problem?
If you are wondering, the two engineers I watched were Ernest Mueller and Peco Karayanev in their online course, “ DevOps Foundations: Monitoring and Observability”. I created the questions from their online course. You can watch their course and find the answers to these questions. They don’t explicitly ask these questions. Their course includes many topics, in addition to those listed here, which are of value to testing or finding problems. A good example is their introduction sections on ‘What is monitoring’ and ‘Math is required’. Note that this isn’t a promotion of their course. However, I found their course very informative.
When writing this article, I wanted to play around with the idea of testing different aspects of DevOps. I tried to avoid questions like, ‘What is synthetic monitoring?’. Instead, I rephrased questions or asked questions which are not obvious. The course is a good reference on the construction aspects of monitoring.
Monitoring is a very technical topic. I don’t know if you can just ‘test’ monitoring without extensive prior experience. However, if testing is critical thinking about risk or failure, I’d like to think that asking non-intuitive questions will leave you better off than not asking questions. I’d like to think that an Ernest Mueller or a Dan Ashby or a Lisa Crispin as part of a whole team can ask questions to determine if something can go wrong with monitoring. On the other hand, I don’t think the network operations team will ever be part of DevNetOps (reference from the course)!!
Originally published at https://www.linkedin.com.