I read Datadog's "Modern Infrastructure Monitoring"

surveillance

I have to use Datadog at work, so you can download it from Datadog's website. Modern infrastructure monitoring was read by a monitoring amateur.

わたろー
Wataro

I'm not good at reading documents like this carefully, but there were a lot of things written that seemed important when using Datadog.

This article is mostly a summary/summary of the above documents. The same applies to the source of images. I am writing my personal impressions after reading it.

  1. Chapter 1: Constant Change
  2. Chapter 2: Collect better data
    1. Metrics - “system-related values at a particular point in time”
    2. event
    3. Tagging - "metadata that declares all the different ranges to which a data point belongs"
    4. key: value tag
    5. What is better data for monitoring?
  3. Chapter 3: Alert on issues that really matter
    1. Alert severity level
    2. Data for alerts and data for diagnostics
  4. Chapter 4: Investigating Performance Issues
    1. Start with work metrics
    2. Deep dive into your resources – When looking at top-level work metrics doesn’t reveal the cause
    3. did you change something?
    4. Fix it (and remember to record it) - Once you know the cause, fix it and think of ways to avoid similar problems.
    5. How to track metrics
  5. Chapter 5: Visualizing Metrics with Time Series Graphs
    1. Aggregation across space
    2. line graph
    3. Stacked Area Chart – Metric values are displayed as two-dimensional bands instead of lines
  6. Chapter 6: Visualizing Metrics with Summary Graphs
    1. Aggregation across time – Example: Maximum value reported by each host over the past 60 minutes
    2. Aggregation across space – example: Show Redis latency for each service
    3. Single value summary – Example: Current number of hosts in OK state
    4. Toplist – Example: Maximum Redis latency for each AZ in the past hour
    5. Trend graph – Example: Compare the last 4 hours of data for each login method with yesterday’s 4 hours of data.
    6. Host Map – Resource usage list
    7. Distribution – e.g. web latency per host etc.
  7. Chapter 7: Centralize all your information: How to monitor your ELB
    1. Key ELB performance metrics
  8. Chapter 8: Centralize everything: Monitoring Docker
    1. Why use containers?
    2. Container challenges
    3. Key Docker resource metrics
    4. How iHertRadio monitors Docker
  9. Chapter 9: Datadog is a dynamic, cloud-scale monitoring solution
  10. summary

Chapter 1: Constant Change

It felt like a simple analysis of the current situation.

  • Difference between pets and livestock-It's often used metaphorically.
  • DevOps
  • Modern Approaches in Surveillance - What I was curious about:"Advanced alert creation". Alerts when set thresholds are exceeded are common features.Automatic detection of outliersThese are said to be provided by advanced monitoring systems.
  • How modern monitoring frameworks work

Chapter 2: Collect better data

わたろー
Wataro

around here,If you read carefully, the words used in Datadog will be easy to understand in the workplace.I thought so.

Metrics - “system-related values at a particular point in time”

  • Work Metrics - "Useful Output" "Top-Level Health of the System" - Example: Number of Requests per Second
  • Resource metrics - "CPU" "Memory""A database ~ is also a resource if other systems require use of its components."

event

  • "Events that occur individually and irregularly" "Important context for understanding changes in system behavior" "What happened at a certain point in time"
  • "Scaling Event: Adding or Removing Hosts or Containers"
  • example:Nightly data rollup failed

Tagging - "metadata that declares all the different ranges to which a data point belongs"

Tags seem to indicate the range and where the event occurred. As an example of an image,You can search with file-server etc.Is that what you mean?

key: value tag

This explanation seems to be easier to understand if you have actually used it.

It says "key: value tag",Something like key:value tagI guess. (Feeling...)

After declaring instance-type:m3.xlarge, declare instance-type as the key and define m3.xlarge as the value.

In this case, it seems possible to combine any dimensions by adding instance-type:m3.medium etc. later.

What is better data for monitoring?

  • easy to understand-Metrics and events are named to make it easy to understand what they mean.
  • Granularity-Good balance in collection frequency, period, etc.
  • Tagged by scope-Can be checked in combination in any scope (region, etc.)
  • long retention-For as long as possible.At least 1 year or moreThen you can understand the differences between seasons.

Chapter 3: Alert on issues that really matter

As the title says,When you get so many alerts, you don't know what's important.Therefore, we talked about making sure that the necessary alerts are sent when they are needed.

Alert severity level

  • Alerts to record - low importance - 😅
  • Alert to be notified - Medium severity - Insufficient disk space on data store (but when immediate action is not required) - 😓
  • Alerts to send urgent messages - High importance - Web application response time, etc. - Contact the person in charge immediately 😭

Data for alerts and data for diagnostics

Specific examples were given.

Let's look at how to set alert urgency from the following perspectives:

  • Is this really a problem? – Test environment metrics, etc. You don't need a test... Also, of course, things that are planned, such as periodic restarts, are not a problem.
  • Does this problem require attention?
  • Is this issue urgent? – "It needs to be fixed, but it's not an emergency."case andCritical systems are operating beyond acceptable limitsThe response varies depending on the situation.
わたろー
Wataro

This is something that may be related to SLO.

Other,

  • Urgent message regarding symptoms
  • Defining long-lasting alerts
  • early warning signs

Such,Humans are the ones who take care of the system.I felt that they were aware of this fact.

Chapter 4: Investigating Performance Issues

The following methods are often used to diagnose the root cause.

  • rely on intuition
  • guess and verify

Well, it certainly feels like..."An approach with a clear direction"I would like to know. I tried to summarize it in detail.

Start with work metrics

Let's be able to answer this question properly.

  • Are you experiencing a problem?
  • What characteristics does the problem have?

Deep dive into your resources – When looking at top-level work metrics doesn’t reveal the cause

  • Physical resources (CPU and memory)
  • of each systemMake resources easy to view on the dashboard(It seems important to prepare before a problem occurs) – We recommend setting up one dashboard for high-level application metrics and one dashboard for each subsystem.
p23 dashboard example
  • Is it available? Is the usage rate high? Are you saturated?

did you change something?

  • Code release before the problem occurred
  • Internal alerts before a problem occurs
  • Events before the problem occurs

Fix it (and remember to record it) - Once you know the cause, fix it and think of ways to avoid similar problems.

How to track metrics

  • Set up a dashboard that displays all key metrics for each system in your infrastructure and overlay related events.
  • Investigate from the top-level system and verify work metrics, resource metrics, and related events.
  • If a problem is found, recursively investigate the root cause using the same method as above.

Chapter 5: Visualizing Metrics with Time Series Graphs

Aggregation across space

Requests per host and requests per AZ

I can't really tell what's going on when I look at the requests for each host, but I can clearly see it for each availability zone.

line graph

line graph

The intended use and examples are as follows.

  • Understand outliers at a glance – CPU idle for each host in the cluster
  • Clearly communicate the evolution of important metrics over time – average latency across all web services
  • Determine whether individual outliers are unacceptable – Disk space usage by database node

However, this part is a practical description,

It was very easy to understand because it explained what to display, why, and examples.You can also use it as a reference when creating a dashboard for your own problem handling/system monitoring preparations.

Also,Improper usage of graphs and graphs to use instead are described with images.It had been.

Stacked Area Chart – Metric values are displayed as two-dimensional bands instead of lines

I thought about posting them all, but there were too many, so I decided against it...tears.

for now“What kind of dashboard do I need when I want to monitor something?”It might help when you think about it!

Chapter 6: Visualizing Metrics with Summary Graphs

  • Aggregation across time – Example: Maximum value reported by each host over the past 60 minutes

Aggregation across time – Example: Maximum value reported by each host over the past 60 minutes

Aggregation across space – example: Show Redis latency for each service

Single value summary – Example: Current number of hosts in OK state

Toplist – Example: Maximum Redis latency for each AZ in the past hour

Trend graph – Example: Compare the last 4 hours of data for each login method with yesterday’s 4 hours of data.

Host Map – Resource usage list

“Datadog’s unique visualization method”... apparently ... It looks like a beehive and is easy to list.

わたろー
Wataro

Futuristic!

Distribution – e.g. web latency per host etc.

Chapter 7: Centralize all your information: How to monitor your ELB

Key ELB performance metrics

A brief explanation of ELB and key metrics were listed. When actually using it, it seems necessary to understand the detailed definition correctly.

  • load balancer metrics
  • Backend related metrics
  • About useful metrics that can be obtained with CloudWatch
    • REQUESTCOUNT
    • SURGEQUEULENGTH – Requests buffered on the ELB side (for example, when the server performance is insufficient and the request is temporarily held in the ELB)
    • SPILLOVERCOUNT – Number of times that exceeded the SURGEQUELENGTH limit above and returned 500 errors
    • HTTP_CODE_ELB_4XX*
    • HTTP_CODE_ELB_5XX*

There seems to be no problem with items that can be seen at a glance, but I think you need to be careful about metrics that you don't know at a glance.

  • Metrics to turn on alerts – Latency
  • Metrics to watch – BackendConnectionErrors
EC2Datadog dashboard template
  • native metrics – Get EC2 metrics directly from the server to Datadog instead of getting them from CloudWatch
    • Looks like you need to install an agent
    • By installing the Agent, you can also collect application metrics such as Mysql, Nginz, Redis, etc.

Chapter 8: Centralize everything: Monitoring Docker

  • Monitoring Docker is extremely important (Container monitoring challenges are poorly understood)
  • The boundaries between application performance monitoring and infrastructure monitoringtends to be a blind spot
The boundaries between application performance monitoring and infrastructure monitoring

Why use containers?

  • Ease of scaling
  • Break away from dependencies – less management effort for engineers

Container challenges

Container changes
  • Complexity of large-scale operations – Basically the same best practices as for hosted operations can be used, but the number of containers tends to be large, so this has not been an issue until now.
  • Rather than monitoring from a host-centric perspective, monitor from a layer and tag-centric perspective
  • Issue queries to create complex conditionscan be specified
  • Monitor all layers of your stack together
  • Tag containers – treat them as queryable containers

Key Docker resource metrics

  • System CPU
  • User CPU
  • strot ring
  • memory
  • RSS
  • cache memory
  • SWAP
  • I/O
  • network bytes
  • Number of network packets
  • Etc., etc

How iHertRadio monitors Docker

  • Operate various services
  • small group of engineers
  • Disadvantages of Docker – No visualization of resource usage at the container level
  • Using Datadog – Easily start Agent containers with the docker run command
  • The metrics that can be collected are different depending on whether the agent is installed directly on the host or in a container.
  • Service Discovery – Each time a container is created and started, the Agent identifies which services are running in the new container.

Chapter 9: Datadog is a dynamic, cloud-scale monitoring solution

  • Comprehensive monitoring – can be integrated with various services
  • Flexible aggregation – Easy to aggregate various metrics and events by tagging
  • Easy scaling
  • Advanced alert creation
  • Enhanced collaboration – Team collaboration is easy (it is certainly convenient to record metrics such as who created them)

summary

So, when using Datadog,Modern infrastructure monitoringI tried reading it quickly, but it took me a lot of time just to summarize it...

personallyInstead of "detailed monitoring of each host", by increasing the scale of containers, "monitoring of abstract groups" is possible.I thought that they were aiming for a transition to .

わたろー
Wataro

I thought I'd read it quickly, but since I'm an amateur monitor, it took me longer than I expected...

comment

Copied title and URL