Basics: Understanding how Sysdig Monitor aggregates data

 

In the ideal monitoring system, you have:

  • The entire history of your data
  • At full resolution
  • And the ability to visualize it at full granularity

 

The reality is, due to limitations of performance, cost and even a person’s limitation to process data, the ideal situation isn’t realistic or particularly valuable. Sysdig has fine-tuned its system to automatically manage data at the appropriate resolution to achieve a happy medium while not requiring our users to perform any additional maintenance work around data retention.

We do this through automatic aggregation of data for long periods of time, and on-the-fly aggregations for data based on what you are trying to visualize in your browser. In a general sense, aggregation refers to presenting data in summary form, such as if we want to represent many samples of data with a single point. Let’s discuss the two types of aggregations Sysdig performs and how you can work with them.

Two Types of Aggregation: Time aggregation and Group aggregation

Aggregations in Sysdig Monitor fall into two major categories: across time and across groups. These aggregation methods apply to metrics displayed in charts as well as metrics used to calculate alerts.

 

Screen_Shot_2017-05-16_at_12.13.31_PM.png

In the above example, we see the two kinds of aggregations in one image. First, you see we’re displaying data over a long period of time (2 weeks in this case). Therefore data is aggregated over time, reducing the number of datapoints required to represent the entire timeframe. We’re also displaying a line per deployment (service), and while not evident here, each of those services could be made of up any number of containers. So each of those lines represents a group aggregation.

Let’s give you the details to help you understand how these aggregations work and how you can best leverage them to your advantage.

 

Methods of aggregating data

In a generic sense, any consistent mathematical formula could be used to aggregate data. Sysdig Monitor has many built-in aggregation methods that can be applied to your data. We’ve found that these are the most common aggregations that our users want to apply when looking at historical data. The fact that we have pre-built these into our system simplifies any data management on your part and also makes the system very fast and responsive to your queries.

 

 

Time Aggregation

Group Aggregation

Average

x

x

Max

x

x

Min

x

x

Rate

x

 



Time Aggregation

Sysdig agents collect raw data at a 1 second resolution.  Given Sysdig Monitor’s long data retention periods (see our pricing page for retention per tier), Sysdig natively aggregates this data into fewer datapoints over long periods of time for both speed and data efficiency. (This process is sometimes called a “roll up”.) In fact, the first aggregation happens before data ever gets stored - the backend process that first receives 1-second points from Sysdig agents aggregates these to 10-second granularity before committing them to long-term storage.

 

The backend stores four native forms of data, out of which the aggregation methods described above are created:

  • Sum
  • Count
  • Max
  • Min

 

Sysdig Monitor also automatically determines what resolution data to display to you. Unlike systems that require you to manually roll up data into different metric names, Sysdig dynamically manages the computations behind the scenes for you. This table summarizes the resolution (frequency) of data that you’ll see:

If you request data in...

You’ll see one data point every...

The last 10 minutes

10 seconds

The last hour

1 minute

The last day

10 minutes

The last 2 weeks

1 hour

Older than two weeks

1 day

 

For example, let’s examine a metric over the last 10 minutes and you’ll see that we by default have 10 second resolution. Hovering over a data point shows the time stamp down to the 10-second granularity:

Screen_Shot_2017-05-16_at_12.14.36_PM.png

And if we look at the last two weeks for the same metric we have 1 hour resolution.

Screen_Shot_2017-05-16_at_12.15.16_PM.png

If we look at 2 weeks’ worth of data, but do it from 3 months ago, historical data points are fewer, since that data has already been aggregated into coarser resolution data.

Screen_Shot_2017-05-16_at_12.16.09_PM.png

 

If you use a very specific time frame, for example “Show me data from 5:07 to 6:25 Yesterday,” Sysdig will auto-align your query to fit to as many data points as possible at the given resolution. Sysdig alerts you to this as you’re adjusting your timeframe. You’ll see this in the example below.

Screen_Shot_2017-05-16_at_12.17.17_PM.png


When a container or host is no longer monitored (as when the container is killed or an agent is uninstalled) historical data is not removed and will continued to be stored according to the policy described above. You will always be able to see your metrics in the views for the time period that the agent was installed and reporting.

Finally, when displaying data of any resolution, you choose what aggregation you’d like to see. By default, “Average” is what will be displayed to you. But you can change that by selecting the aggregation setting in Explore or a Dashboard Panel.

Screen_Shot_2017-05-16_at_12.35.22_PM.png

Group Aggregation

When you are examining a metric across a group of items (eg hosts or containers), by default metrics are averaged between the members of the group.  It’s important to note that group aggregation always takes place after time aggregation.

For example, we have a Java App deployment made up of three Kubernetes pods.

Screen_Shot_2017-05-16_at_12.56.46_PM.png

Each pod will report its own CPU usage for one sample interval. The three values will be aggregated and reported on the chart as a single datapoint for that metric.  Using the Group Aggregation menu, you can change group aggregation between Rate, Minimum, Maximum and Sum.

 

We can start by using the “segment by” feature to segment the deployment by Pods, to see the value of each pod:

Screen_Shot_2017-05-16_at_12.57.35_PM.png

Then we can instead look at the aggregate across the entire deployment, which by default is a rate across the group:

Screen_Shot_2017-05-16_at_12.57.50_PM.png

 

Finally, we could switch the aggregation to see the Maximum value at any given point:

Screen_Shot_2017-05-16_at_9.12.22_PM.png



Conclusion

Understanding aggregation is really important. If you don't understand it you'll find that you will lack the appropriate context to troubleshoot your system. So, when looking at any data, ask yourself the following:

  • What's the timeframe I'm looking over? As a result what's the time aggregation of the data?
  • What type of aggregation is it? Average, Sum, Min, Max, Rate?
  • What's the group aggregation? What type of aggregation?
Have more questions? Submit a request