Challenges of Monitoring Containers in Multicloud Environments
Containers can deliver flexibility and optimization for development cycles, but an enterprise rarely deploys only one cloud type in their environment; most leverage containers across multiple environments. In the context of such hybrid environments, how can enterprises strategically monitor multiple environments in a consistent and efficient manner to best analyze and act on the data?
Prevalence of Multicloud Environments and Monitoring Challenges
Now that cloud computing has changed how businesses build IT systems, creating an effective hybrid cloud architecture has become increasingly critical. Hybrid clouds offer opportunities for innovation, simplicity, protection of data, performance, customization and the chance to avoid vendor lock-in. By offering flexibility and high availability, hybrid cloud services—whether seen as combining public and private clouds or integrating on-premises software with multiple cloud services—are becoming the future of web infrastructure.
The ubiquity of multicloud environments raises the question of how to orchestrate consistent and flexible monitoring across multiple hubs. Realizing the benefits of providers across multiple cloud environments requires a view into all of them that solves incompatibilities in a uniform manner. Cloud providers solve the hybrid problem for their own infrastructure only. Service providers can solve this problem but just for the services they provide. Multicloud environments result in incompatible APIs and systems, different dashboards and siloed monitoring—all adding up to inconsistent or poor visibility into the hybrid system, especially when it comes to the time series data they produce.
The Need for Consolidated Monitoring Servers
Implementing a consistent data store to accumulate, analyze and act on the myriad of metrics and events being generated from the thousands of containers and applications is the only prudent solution to solve this problem. This provides operational flexibility since each service and/or container can write their event or metrics data to a consistent API and the event service data can be replicated across the environment and secured.
Understanding the Data Problem
Metrics and events are really just time series data, or data that has a timestamp as part of the data. The analysis is all about looking at change over some time boundary. A time series data platform is the best architecture to consider for the metrics and events store. A time series database (TSDB) is built specifically for handling metrics and events or measurements that are time-stamped, and is optimized for measuring change over time, even across multicloud environments.
Looking into a TSDB for the Metrics and Event Store
Metrics and event data is time series data which has unique properties that make it very different than other data workloads. These are related to high ingestion rates, real-time queries and time-based analytics. The properties are described in further detail below.
- Needs fast ingestion of data: Time series data operates at a significantly different scale than many traditional or relational databases. In a containerized environment, containers—depending on how frequently you’re pulling for certain information (such as CPU usage, memory usage, disk usage)—could generate a thousand points per second, per container. Trying to fit that all that data into traditional database architectures results in database locking issues (the database in trying to ingest all this data and locks out any other reads/writes) and performance problems since the internal database engine isn’t optimized for time stamp storage and indexing.
The more advanced time series databases can handle hundreds of thousands of data points per second and can handle a large volume of writes, and also makes them available for querying all without blocking. These allow for high throughput ingest, compression and real-time querying of time series data. - High and low precision data have different retention policies: Time series data is stored based on the precision of the data (CPU usage in milliseconds, container utilization in microseconds etc.). Which time intervals matter most in terms of business requirements? It is impossible to have a 99.999 percent SLA if you can’t store performance data in milliseconds. Otherwise, by the time you detect a problem, it will be too late to fix. How long you store this data and what precision are also an important architectural considerations. For example, developers may be interested in the last 15 minutes or an hour of data. But with time, the value of that data starts to decline rapidly as it becomes useless for operational reasons. Time series databases should be able to expire this data or move it to historical analysis stores once it becomes stale, and then bring it up into summaries if needed after it has been expired.
To resolve any storage concerns resulting from working with that much data over a long period of time, more advanced time series platforms offer data downsampling—keeping the high-precision raw data for a limited time, and storing the lower precision, summarized data for much longer or forever. Organizations should look for solutions that automate the process of downsampling data and expiring old data. - Time-based analytics: Stream processing analyzes and performs actions on real-time data. Streaming analytics connects to external data sources, enabling applications to integrate data into the application flow or update an external database with processed information. Analyzing data in real time facilitates making operational decisions and applying them to business processes and transactions in real time on an ongoing basis.
An approach to consider is leveraging the benefits of Kapacitor, a popular open source native data processing engine, which can process both stream and batch data in real time. It lets organizations plug in their own custom logic or user-defined functions across containers and hybrid environments to consistently process alerts with dynamic thresholds, match metrics for patterns, compute statistical anomalies, and perform specific actions based on these alerts (such as dynamic load rebalancing). Kapacitor also integrates with HipChat, OpsGenie, Alerta, Sensu, PagerDuty, Slack and more. - Unique and changing time query: In relational data stores, it is tricky to perform various queries around grouping by a given hour of the day and apply the various functions related specifically to time. The most advanced time series databases can handle various queries based on time. They provide a high performance write and query HTTP/S API and support plugins for data ingestion protocols such as Telegraf, Graphite, collectd, and OpenTSDB.
Using Time Series Databases in Multi-Cloud Environments
A time series use case typically requires dealing with multiple clouds in multiple deployments across different clusters and in different environments. For example:
- Each server deployed potentially has hundreds of containers and generates a huge data load.
- Containers come out of service and new containers come in.
- Each container has different IDs, and organizations want to track which operations and other stats are associated with different containers.
Time series databases are a way to gather metrics and events from apps and microservices and centrally view and manage all components. They provide services and functionality to accumulate, analyze, and act on time series data across multicloud environments.