Customers choose the Sysdig Monitor Software offering in order to deploy the Sysdig Monitor backend within their own data center or VPC. As with all software this gives you fine-grained control of the backend, but the administrator will need to make decisions that affect performance, reliability and cost.
These are important decisions as they will impact not only your performance but your users’ experience with Sysdig and your monitoring data. We want to help you make sure the experience is as flawless as it can be.
This sizing guide will help you understand the components of the backend and allow you to make the right decisions based on your environment and your needs. We’ve boiled down basic recommendation so that you can get started quickly; we’ve also given you steps on how to understand the process of scaling up as you grow your deployment.
Understanding the Sysdig Monitor backend architecture
There are a number of components that make up the Sysdig Monitor Backend. Whether you are working with the Sysdig Monitor single-node deployment, or the horizontally scalable, multi-node deployment, your software will contain the same components. The diagram below shows the relationship between various services.
Figure 1. A conceptual diagram of all the software components that make up Sysdig Monitor.
The table below describes the purpose of each of these components.
|Component||Stateful / Stateless||Purpose|
|Agent||The agent lives on your hosts being monitored and collects the appropriate metrics and events|
|API Endpoint||Stateless||Browsers interface with the API to make data requests. You may write scripts to interact directly with the API.|
|Collector||Stateless||Agents connect here to deliver data to the Sysdig Monitor backend|
|Cassandra||Stateful||Stores all metrics|
|Elasticsearch||Stateful||Stores all events & metadata|
|Worker||Stateless||Processes data aggregations and alerts|
|Redis||Semi-Stateful||Serves as an intra-service cache|
|MySQL||Stateful||Stores user credentials & environmental data|
At a high level there are three models of deployment for Sysdig Monitor. You can think of them as small, medium, and large. They each have different implications for the maximum number of agents that can be monitored as well as the resiliency of the deployment.
|Maximum number of hosts monitored||Minimum nodes required for Sysdig Monitor||HA||Notes|
|Small||20||1||None||Only recommended for testing or Dev deployments|
|Medium||150||3||Some||Recommended for production and provides for future scalability|
|Large||10,000+||10+||Yes||Recommended for production and provides for future scalability|
Detailed Sizing for Small, Medium, and Large Deployments
The smallest deployment of Sysdig Software requires a single backend node to run all components. Installation requires a single 2-core, 8GB RAM machine with 30GB disk space. Given that all components are running on one machine, it cannot sustain heavy loads and has no HA.
This makes it ideal for feature/functionality trials or dev/test environments where reliability isn’t as high a concern.
We do not recommend this for production environments.
The medium deployment provides for additional scale out, some reliability through the use of 3 nodes, but still relies heavily on shared physical resources. Each node is assumed to be a machine with 8 cores and 16GB of RAM. These recommendations can scale roughly linearly - that is, two 4-core, 8 GB RAM machines can be used in place of one of the larger recommended machines.
This model works well in production because each stateful service functions across multiple nodes, while stateless services are spread throughout.
Figure 2. A medium-sized Sysdig Monitor deployment that provides for some reliability while still sharing physical resources.
We recommend that most customers start with this medium sized deployment. From here, you can grow any component of your system as needed.
As different components have different roles and sizing characteristics, starting with this deployment model lets you see what is taking on the most load in your system. For example:
|If you are experiencing...||What to do|
|Starting to scale number of agents or collect more metrics per agent||Scale Cassandra|
|Increasing API utilization||Scale API servers|
|Increasing collection of events||Scale Elasticsearch|
Tip: For the majority of customers who start with the Medium deployment and then need to scale up, we typically recommend a next step of three additional servers which are shared by Elasticsearch and Cassandra. This gives your datastores greater reliability, while providing more resources for all components.
The large model actually consists of a scale-out model that assumes high physical resource usage in exchange for a very high number of hosts supported, high availability, and the easiest model to further scale up individual components as needed.
You will notice that even the 250-agent deployment has a large number of servers, because it is designed with the highest level of scale and reliability. We don’t recommend that customers start here - the Medium model is preferred.
Instead, use this as guidelines to help you understand where you will go when you’ve adopted Sysdig Monitor at a very high scale.
|For a deployment up to this many agents->||250||500||1000|
|Plan on this many nodes per component
(assumes suggested HA - see notes below on how to tune/reduce components)
|Cassandra||3 / 250GB||3 / 500GB||3 / 1TB|
|Elasticsearch||3 / 250 GB||3 / 500 GB||3 / 1TB|
|Redis||1 (on shared host)||1 (on shared host)||1 (on shared host)|
|MySQL||1 (on shared host)||1 (on shared host)||1 (on shared host)|
For simplicity each node is assumed to be a machine with 8 cores and 16GB of RAM. These recommendations can scale roughly linearly - that is, two 4-core, 8 GB RAM machines can be used in place of one of the larger recommended machines.
Component-by-component sizing considerations
Each component has particular dynamics that will drive sizing. For the system as a whole, sizing will be driven by the following characteristics of your system:
- Number of hosts monitored
- Containers per host (container density)
- Container churn (how quickly containers come and go)
- Number of metrics collected per host or container
- Number of users logging in
- Your tolerance for downtime (to shape HA requirements)
Not every component in the system is impacted in the same way (or impacted at all) by every characteristic. We can step through each component and better understand how to size it.
The API servers serve out API requests to the front-end. The API server also serves out the front-end resources (JS, CSS, images etc) to the browser.
The API is impacted first and foremost by the number of concurrent browser sessions running, but secondarily impacted by the amount of data each request is making. Thus, if you have a large environment with many containers, you might find you are at the low end of these numbers. If however that’s not the case, or most of your users are scoped to small teams, you may find you can even support more than we suggest. This is a stateless component, so they can easily be spun up and down as your load requires.
We recommend an API server per 5-10 concurrent user sessions. Additional sessions can certainly be supported based on the characteristics above; response time may suffer in some cases. We highly recommend that you observe your own environment here and adjust accordingly.
We recommend N+1 for high availability.
The collectors are responsible for collecting data received from the agents running on the hosts to be monitored, and persisting it in the Sysdig Monitor backend. This component is stateless and can be safely destroyed and a new one created in its place
Collectors typically scale with the number of agents in the environment.
Our rough guideline is 500 agents per collector.
We recommend N+1 for high availability.
Workers are potentially CPU intensive and are used for doing 3 categories of work: Rollups, Alerting, and polling AWS Metrics. This component is stateless and can be safely destroyed and recreated.
We recommend 2 workers as the default install for high availability purposes
Cassandra is used as the metrics store for Sysdig agents. It is the most dynamic component of the system, and requires additional attention to ensure that your system is performing well and highly responsive.
This component is stateful, and should be treated more carefully than stateless components. Cassandra sizing is based on a minimum replication factor as well as the number of agents writing data.
A minimum replication factor of 3 is recommended for our app, which allows the cluster to survive the failure of 1 Cassandra instance.
Each agent consumes anywhere from 500MB to 2GB of Cassandra storage with average sizing at 1GB/agent. Because of our data aggregation model, this storage should comfortably handle multi-year history. This needs to then be multiplied by the replication factor to determine the total disk space required. A rough calculation would work out like this
100 agents = 100GB raw, with replication factor of 3, hence 300GB total
To be safe we recommend that you size some additional disk space as buffer (say 25-50%) on top of that.
Elasticsearch is used to store event data and quickly provide powerful searches. It also serves to store metadata on which Cassandra relies.
The largest driver of Elasticsearch storage is the number of events stored. Typically an event takes 1KB of data. Our very conservative estimates assume you’ll need 1 GB of storage per agent, spread across the number of nodes you’re using. This is again something you can observe within your environment and then adjust your storage accordingly. However, we've found that most Sysdig Monitor customers find it easier to over-provision storage to be safe.
While event data is important, it may not be as critical to your usage of the product as metric data stored in Cassandra. As a result, you may choose not to run Elasticsearch in HA mode.
Like Cassandra, a minimum of 3 nodes to enable scale/HA. Otherwise 1 node suffices.
Redis is used to exchange messages between the different components of our application.
This component is stateless and can be destroyed and recreated but needs an application restart to do it safely.
It is low volume / low resource utilization, and as such we recommend that Redis can share a node with MySQL.
MySQL is used to store configuration data (user accounts, dashboard configurations, etc). This component is stateful, and also low volume like Redis.
While commercial versions of MySQL can be deployed in active-active HA mode, open source versions cannot. As such we typically recommend backing up MySQL vs HA.
We recommend that MySQL can share a node with Redis.
Questions, Comments, or Need Advice?
Please visit sysdig.com/support for more data and documentation. Also, do not hesitate to open a support ticket and we can assist you through any sizing challenges or migrations. We’re happy to help.