This is a multi-part post that drills into one of the three pillars of Node.js cloud enterprise development. Let’s review what I consider the three pillars to be. The first pillar is that you have quality code that is being developed by people, processes and tools. The second pillar is that you have a Continuous Integration and Delivery system to produce a deployed application. The third pillar is having Application Performance Management in place. This last pillar is what this series will focus on and consists of the monitoring and alerting as part of your DevOps strategy.
What is the goal of APM Monitoring and Alerting?
You might want to first think about what the goal of your application is? Let me help you there. The goal of your application should be to serve the needs of the end customer and provide ROI to your business. Ok, with that out of the way, we can discuss what the goal of monitoring and alerting should be.
It should be obvious that if your application is it not Available, Reliable and Performant then your customer and your business will suffer. Thus, the goal of monitoring and alerting is to maintain application Availability, Reliability and Performance. You do this by implementing APM (Application Performance Management). This allows you to discover problems before anyone else does and then to achieve resolutions to them in the least amount of time.
In a professional IT organization, you define an SLA (Service Level Agreement) that defines what is acceptable as far as meeting the customer needs. The SLA defines values that you would be monitoring and alerting on so you would know if the agreement is being met. SLA values are arrived at from multiple directions. The business must decide what is reasonable, you need to consider what the customer requires and finally, you need to arrive at the number through actual test measurements made to prove the breaking points and peak operating levels that can be sustained.
Let me present a visual representation of the APM components that connect up to your Node.js application. Here is a good visualization of the components of APM:
You can see at the top we have the exercising of our Node.js application through normal customers as well as through some proactive testing we keep continually running.
In a PaaS environments, there is a Web App wrapping our Node.js instance that sees the Web requests first and is actually collecting metrics on its usage, such as response times and CPU usage. We may mention a bit about that in this post, but in subsequent ones, we leave that out and move on to separate complete APM systems.
Inside of the actual Node.js application we have code that generates the telemetry. This comes from the inclusion of a special module in your Node.js code for standard telemetry, and the customized use of it for sending up your own telemetry.
Let me introduce the term Real User Monitoring (RUM). This is the idea that your APM is watching everything that is transpiring from normal customer channel interactions. You use RUM to be able to report on if you are meeting the SLA for the customer. APM/RUM is extremely useful to have in place when new code is released to production. You can immediately be informed of any service degradations and roll back if necessary.
To augment the RUM, you also need to have “synthetic” transactions that you run to continually test your production environment to be proactive about finding issues.
A Console log does not constitute APM
If you have written even the most basic Node.js application, you probably have made use of the console.log() function to spit out strings to view in the console window to tell you what is happening. These strings might be important information that can help you track down errors that are happening in your Node application, or provide a view into what is currently going on.
Another interesting way to log information at runtime is through the use of the “morgan” module. This can be set up with two simple lines that then cause output logging that tracks each and every HTTP request and gives information on the response time of each.
With these two logging capabilities, you could view the streaming output in production to see what was happening on AWS or Microsoft Azure. Of course, it might be fun for a few minutes to watch as the information spews by. You could take that output and write it to Azure blob storage and then used HDInsight to sort through the data. There are a myriad of ways you can implement logging and monitoring. But what if there were some way to capture the logging output and do things like provide reports on common/custom metrics in a tool made especially for monitoring and alerting of telemetry?
Let’s discuss the topic first before diving into the usage of specific tools and technologies. To start with, we can discuss what it means to gather telemetry data. With that gathered, we can report on it and also put in place alerting over it.
Telemetry is just a word that means the gathering of remote data and transmitting it for monitoring to take action. From our Node.js application, it would take the form of events, logs and metrics. Some tools may have many more categories than these, but these are the three we will concentrate on.
Event: An event is something taking place in your application that you would want to track to get an idea of what customers are doing. I’ll give a few examples. In my NewsWatcher application, I could have an event sent for events such as a user log in, news filter update, or a news story comment entry. This way I can know how often any given feature is being used. The aggregate numbers for these events could then be viewed in a report. You could know the hourly count, average, minimum and maximum per a given amount of time. You would be able to tell what hour of the day you had the most people log in. It is up to you to define the events that you wish to track.
Log: A log can be used to save away the history of what is transpiring in the application. The main reason you do this is to detect errors and to be able to go back and review the log to get clues as to what lead up to a problem. For example, you would want a log of when background batch operations begin and complete and some data about what they processed. You would definitely want to log code exceptions that are going on. Make sure to leave yourself what is called a “breadcrumb trail” of details to be able to go back and trace through issues with. The logging is thus just pertinent information that you may actually not be needing unless there is an issue to debug. Every remote call to a dependency should be logged with an entry time and an exit time so you can proactively catch any slowdowns or outages that might affect you.
Metric: These are telemetry values that have a range of values associated with each. For example, you would track CPU usage of a machine and be able to use that to determine if you needed to take action to scale. There are some obvious metrics that are used to validate our SLAs (availability, response time, requests per second, etc.). There are also custom metrics that only your application would have. In my NewsWatcher application, there is a queue of waiting database operations that has a length that grows and shrinks. I could have a timer go off every few minutes that sends that as a metric data point. This helps me know if I need to take steps to change my code or take some other steps to have it scale better.
Reports and alerting on the telemetry data
Once you have the telemetry identified you can then report and alert on it. Log traces, visual graphs and charts are handy to put into place in a dashboard that people can access to give them a reassurance that all is peaceful.
If you are the one assigned to run DevOps for the day, you of course don’t want to sit there and stare at a dashboard all day. This is where you would put alerts in place that would trigger and alert you as to any anomalies. You can set up an alert if any exception log telemetry is seen.
Metrics can have threshold boundaries set that if crossed would cause alerts. Not all metric telemetry should be a cause for an alert. Some of the metrics are being fed into systems that do things like automatically kick in scaling and are not really creating alerts.
To be realistic, let me state that APM is not going to solve all of your DevOps requirements in an IT organization. There is much more than this that needs to be going on. For example, you might need other tools to provide security threat analysis, user account rights management, change tracking, audit trails, backup, configuration management and much more. APM is the bare minimum that you can get by with.
This wraps up the introduction into the topic. You will want to go on to the rest of the posts in the series to learn about actual tools and techniques you can use to implement APM.