Site Search

Splunk

Splunk

Splunk O11y Cloud Blog - The Strengths of No Sampling -

table of contents

Introduction

In recent years, large-scale application development has required rapid and continuous development in response to changes, and there has been a shift to a Microservices architecture in which applications are built as a collection of small Microservices. This trend in application development has created a need for observability (O11y), a system that can comprehensively check the status of everything from the front-end viewed by users to the infrastructure in order to monitor dozens or hundreds of Microservices from a bird's-eye view and identify the root cause when a failure occurs.

 To achieve a high level of observability, it is necessary to obtain and visualize large amounts of data from applications, including logs, metrics, and distributed traces from the frontend, backend, and infrastructure. However, due to the specification limitations of data storage and processing tools, it is difficult to use all of the data, so most tools sample the data and create observability from a portion of the data.

 ※Distributed tracing:
A log that combines parameters, etc. to track and manage a series of processes (transactions), such as calling various services and accessing data stores associated with a user request.

左図:トランザクション例、右図:分散トレース例


Left: Example of a transaction, Right: Example of a distributed trace

Troubleshooting Impact of Sampling

1. Delayed response when an incident occurred

When multiple patterns of events occur that stem from the same root cause, sampling can only quickly identify some of the patterns, and it takes time to confirm all of the patterns. Inferences from some of the patterns slow down the speed at which you can arrive at the root cause.

事象発生時の対応の遅れ


Delayed response when an incident occurs

2. Accidental oversight of abnormalities

If an accidental anomaly occurs and is not detected during the sampling interval, time will pass before the anomaly occurs again and it can be confirmed. An accidental anomaly can cause a serious disruption to your business, and delays in identifying and responding to the anomaly can lead to lost opportunities and reduced reliability.

偶発的な異常の見落とし


Accidental oversight of abnormalities

3. It is not possible to grasp the time when an abnormality occurs

If the time of occurrence of an event cannot be captured through sampling, the time of occurrence cannot be confirmed, making it difficult to infer the root cause from the context of work such as system changes. Identification of the root cause will be delayed, slowing down response speed.

異常発生時点の把握ができない


Unable to grasp the time when an abnormality occurred

No sampling in Splunk O11y Cloud

Splunk O11y Cloud ingests all logs, metrics, and distributed traces without sampling. This means there is no bias in the data set or oversight of anomalies, and you can understand all events from the point of occurrence to the present, enabling more efficient monitoring and identification of root causes than traditional O11y and APM products.

No sampling + real-time performance for quick troubleshooting

Splunk O11y Cloud 's real-time streaming architecture allows for streaming of tens of thousands of transactional data per​ ​second, allowing analysis and visualization of data ingested in seconds, not minutes. Unsampled data analysis and streaming processing allow you to understand all events in real time, enabling you to quickly identify the cause.

in conclusion

As Splunk says, sampling is an approach that contradicts observability. Ignoring even a portion of transaction data can result in a loss of full visibility into the state of your system, which could lead to serious business issues.

To avoid such problems and achieve high observability, we hope you will consider using Splunk o11y Cloud.

Inquiry/Document request

In charge of Macnica Splunk Co., Ltd.

Weekdays: 9:00-17:00