Datadogの『モダンなインフラストラクチャの監視』を読んでみた

I have to use Datadog at work, so you can download it from Datadog's website. Modern infrastructure monitoring was read by a monitoring amateur.

Wataro

I'm not good at reading documents like this carefully, but there were a lot of things written that seemed important when using Datadog.

This article is mostly a summary/summary of the above documents. The same applies to the source of images. I am writing my personal impressions after reading it.

table of contents

Chapter 1: Constant Change
Chapter 2: Collect better data
Chapter 3: Alert on issues that really matter
1. Alert severity level
2. Data for alerts and data for diagnostics
Chapter 4: Investigating Performance Issues
Chapter 5: Visualizing Metrics with Time Series Graphs
Chapter 6: Visualizing Metrics with Summary Graphs
Chapter 7: Centralize all your information: How to monitor your ELB
1. Key ELB performance metrics
Chapter 8: Centralize everything: Monitoring Docker
Chapter 9: Datadog is a dynamic, cloud-scale monitoring solution
summary

Chapter 1: Constant Change

It felt like a simple analysis of the current situation.

Difference between pets and livestock-It's often used metaphorically.

DevOps
Modern Approaches in Surveillance - What I was curious about:"Advanced alert creation". Alerts when set thresholds are exceeded are common features.Automatic detection of outliersThese are said to be provided by advanced monitoring systems.
How modern monitoring frameworks work

Chapter 2: Collect better data

Wataro

around here,If you read carefully, the words used in Datadog will be easy to understand in the workplace.I thought so.

Metrics - “system-related values at a particular point in time”

Work Metrics - "Useful Output" "Top-Level Health of the System" - Example: Number of Requests per Second
Resource metrics - "CPU" "Memory""A database ~ is also a resource if other systems require use of its components."

event

"Events that occur individually and irregularly" "Important context for understanding changes in system behavior" "What happened at a certain point in time"
"Scaling Event: Adding or Removing Hosts or Containers"
example:Nightly data rollup failed

Tagging - "metadata that declares all the different ranges to which a data point belongs"

Tags seem to indicate the range and where the event occurred. As an example of an image,You can search with file-server etc.Is that what you mean?

key: value tag

This explanation seems to be easier to understand if you have actually used it.

It says "key: value tag",Something like key:value tagI guess. (Feeling...)

After declaring instance-type:m3.xlarge, declare instance-type as the key and define m3.xlarge as the value.

In this case, it seems possible to combine any dimensions by adding instance-type:m3.medium etc. later.

What is better data for monitoring?

easy to understand-Metrics and events are named to make it easy to understand what they mean.
Granularity-Good balance in collection frequency, period, etc.
Tagged by scope-Can be checked in combination in any scope (region, etc.)
long retention-For as long as possible.At least 1 year or moreThen you can understand the differences between seasons.

Chapter 3: Alert on issues that really matter

As the title says,When you get so many alerts, you don't know what's important.Therefore, we talked about making sure that the necessary alerts are sent when they are needed.

Alert severity level

Alerts to record - low importance - 😅
Alert to be notified - Medium severity - Insufficient disk space on data store (but when immediate action is not required) - 😓
Alerts to send urgent messages - High importance - Web application response time, etc. - Contact the person in charge immediately 😭

Data for alerts and data for diagnostics

Specific examples were given.

Let's look at how to set alert urgency from the following perspectives:

Is this really a problem? – Test environment metrics, etc. You don't need a test... Also, of course, things that are planned, such as periodic restarts, are not a problem.
Does this problem require attention?
Is this issue urgent? – "It needs to be fixed, but it's not an emergency."case andCritical systems are operating beyond acceptable limitsThe response varies depending on the situation.

Wataro

This is something that may be related to SLO.

Other,

Urgent message regarding symptoms
Defining long-lasting alerts
early warning signs

Such,Humans are the ones who take care of the system.I felt that they were aware of this fact.

Chapter 4: Investigating Performance Issues

The following methods are often used to diagnose the root cause.

rely on intuition
guess and verify

Well, it certainly feels like..."An approach with a clear direction"I would like to know. I tried to summarize it in detail.

Start with work metrics

Let's be able to answer this question properly.

Are you experiencing a problem?
What characteristics does the problem have?

Deep dive into your resources – When looking at top-level work metrics doesn’t reveal the cause

Physical resources (CPU and memory)
of each systemMake resources easy to view on the dashboard(It seems important to prepare before a problem occurs) – We recommend setting up one dashboard for high-level application metrics and one dashboard for each subsystem.

Is it available? Is the usage rate high? Are you saturated?

did you change something?

Code release before the problem occurred
Internal alerts before a problem occurs
Events before the problem occurs

Fix it (and remember to record it) - Once you know the cause, fix it and think of ways to avoid similar problems.

How to track metrics

Set up a dashboard that displays all key metrics for each system in your infrastructure and overlay related events.
Investigate from the top-level system and verify work metrics, resource metrics, and related events.
If a problem is found, recursively investigate the root cause using the same method as above.

Chapter 5: Visualizing Metrics with Time Series Graphs

Aggregation across space

I can't really tell what's going on when I look at the requests for each host, but I can clearly see it for each availability zone.

line graph

The intended use and examples are as follows.

Understand outliers at a glance – CPU idle for each host in the cluster

Clearly communicate the evolution of important metrics over time – average latency across all web services

Determine whether individual outliers are unacceptable – Disk space usage by database node

However, this part is a practical description,

It was very easy to understand because it explained what to display, why, and examples.You can also use it as a reference when creating a dashboard for your own problem handling/system monitoring preparations.

Also,Improper usage of graphs and graphs to use instead are described with images.It had been.

Stacked Area Chart – Metric values are displayed as two-dimensional bands instead of lines

I thought about posting them all, but there were too many, so I decided against it...tears.

for now“What kind of dashboard do I need when I want to monitor something?”It might help when you think about it!

Chapter 6: Visualizing Metrics with Summary Graphs

Aggregation across time – Example: Maximum value reported by each host over the past 60 minutes

Aggregation across time – Example: Maximum value reported by each host over the past 60 minutes

Aggregation across space – example: Show Redis latency for each service

Single value summary – Example: Current number of hosts in OK state

Toplist – Example: Maximum Redis latency for each AZ in the past hour

Trend graph – Example: Compare the last 4 hours of data for each login method with yesterday’s 4 hours of data.

Host Map – Resource usage list

“Datadog’s unique visualization method”... apparently ... It looks like a beehive and is easy to list.

Wataro

Futuristic!

Distribution – e.g. web latency per host etc.

Chapter 7: Centralize all your information: How to monitor your ELB

Key ELB performance metrics

A brief explanation of ELB and key metrics were listed. When actually using it, it seems necessary to understand the detailed definition correctly.

load balancer metrics
Backend related metrics
About useful metrics that can be obtained with CloudWatch
- REQUESTCOUNT
- SURGEQUEULENGTH – Requests buffered on the ELB side (for example, when the server performance is insufficient and the request is temporarily held in the ELB)
- SPILLOVERCOUNT – Number of times that exceeded the SURGEQUELENGTH limit above and returned 500 errors
- HTTP_CODE_ELB_4XX*
- HTTP_CODE_ELB_5XX*

There seems to be no problem with items that can be seen at a glance, but I think you need to be careful about metrics that you don't know at a glance.

Metrics to turn on alerts – Latency
Metrics to watch – BackendConnectionErrors

native metrics – Get EC2 metrics directly from the server to Datadog instead of getting them from CloudWatch
- Looks like you need to install an agent
- By installing the Agent, you can also collect application metrics such as Mysql, Nginz, Redis, etc.

Chapter 8: Centralize everything: Monitoring Docker

Monitoring Docker is extremely important (Container monitoring challenges are poorly understood)
The boundaries between application performance monitoring and infrastructure monitoringtends to be a blind spot

The boundaries between application performance monitoring and infrastructure monitoring

Why use containers?

Ease of scaling
Break away from dependencies – less management effort for engineers

Container challenges

Complexity of large-scale operations – Basically the same best practices as for hosted operations can be used, but the number of containers tends to be large, so this has not been an issue until now.

Rather than monitoring from a host-centric perspective, monitor from a layer and tag-centric perspective
Issue queries to create complex conditionscan be specified

Monitor all layers of your stack together
Tag containers – treat them as queryable containers

Key Docker resource metrics

System CPU
User CPU
strot ring
memory
RSS
cache memory
SWAP
I/O
network bytes
Number of network packets
Etc., etc

How iHertRadio monitors Docker

Operate various services
small group of engineers
Disadvantages of Docker – No visualization of resource usage at the container level
Using Datadog – Easily start Agent containers with the docker run command
The metrics that can be collected are different depending on whether the agent is installed directly on the host or in a container.
Service Discovery – Each time a container is created and started, the Agent identifies which services are running in the new container.

Chapter 9: Datadog is a dynamic, cloud-scale monitoring solution

Comprehensive monitoring – can be integrated with various services
Flexible aggregation – Easy to aggregate various metrics and events by tagging
Easy scaling
Advanced alert creation
Enhanced collaboration – Team collaboration is easy (it is certainly convenient to record metrics such as who created them)

summary

So, when using Datadog,Modern infrastructure monitoringI tried reading it quickly, but it took me a lot of time just to summarize it...

personallyInstead of "detailed monitoring of each host", by increasing the scale of containers, "monitoring of abstract groups" is possible.I thought that they were aiming for a transition to .

Wataro

I thought I'd read it quickly, but since I'm an amateur monitor, it took me longer than I expected...

I read Datadog's "Modern Infrastructure Monitoring"

Chapter 1: Constant Change

Chapter 2: Collect better data

Metrics - “system-related values at a particular point in time”

event

Tagging - "metadata that declares all the different ranges to which a data point belongs"

key: value tag

What is better data for monitoring?

Chapter 3: Alert on issues that really matter

Alert severity level

Data for alerts and data for diagnostics

Chapter 4: Investigating Performance Issues

Start with work metrics

Deep dive into your resources – When looking at top-level work metrics doesn’t reveal the cause

did you change something?

Fix it (and remember to record it) - Once you know the cause, fix it and think of ways to avoid similar problems.

How to track metrics

Chapter 5: Visualizing Metrics with Time Series Graphs

Aggregation across space

line graph

Stacked Area Chart – Metric values are displayed as two-dimensional bands instead of lines

Chapter 6: Visualizing Metrics with Summary Graphs

Aggregation across time – Example: Maximum value reported by each host over the past 60 minutes

Aggregation across space – example: Show Redis latency for each service

Single value summary – Example: Current number of hosts in OK state

Toplist – Example: Maximum Redis latency for each AZ in the past hour

Trend graph – Example: Compare the last 4 hours of data for each login method with yesterday’s 4 hours of data.

Host Map – Resource usage list

Distribution – e.g. web latency per host etc.

Chapter 7: Centralize all your information: How to monitor your ELB

Key ELB performance metrics

Chapter 8: Centralize everything: Monitoring Docker

Why use containers?

Container challenges

Key Docker resource metrics

How iHertRadio monitors Docker

Chapter 9: Datadog is a dynamic, cloud-scale monitoring solution

summary

comment