Observability of Distributed System

Pillars of Observability

Logs

Logs are structured messages or unstructured lines of text generated by a system when certain code runs. Logs are comprehensive records of events in a system. They provide details of a system event such as an error, why it occurred, where it occurred, and the specific time the event happened. Especially in a microservices environment, logs help to uncover the details of unknown faults or emergent behaviors exhibited,

By analyzing the details of log data, you can debug and troubleshoot where, why, and the time an error in the system occurred.

Metrics

Metrics are collective values represented as a measure of the aggregate performance of a system over a period. Unlike logs, metrics give a holistic view of the events and performance of a system over a period.

You can gather metrics such as system uptime, number of requests, response time, failure rate, memory, and other resource usages over time. DevOps engineers typically use metrics to trigger alerts or certain actions when the metric value goes above or below a specified threshold.

Metrics are also easy to correlate across multiple systems to observe trends in the performance and identify issues in the system.

Tracing

The third telemetry data that makes up the observability pillars, tracing, refers to tracking the root source of a fault, especially in distributed systems. Tracing records the journey of a request or action as it moves from one service to another in a microservice architecture. This enables professionals to identify the system bottlenecks and resolve issues faster.

Tracing is especially useful when debugging complex applications because it allows us to understand a request's journey from its starting point and identify which service a fault originated from in a microservice architecture. Even though the first two pillars, logs, and metrics, provide adequate information about the behavior and performance of a system, tracing enhances this information by providing helpful information about the lifecycle of requests in the system.

From these three data, you can estimate or determine the current state of an IT system without further investigation, which makes the system observable.

However, these pillars are only components of observability and are not actionable enough.

To implement observability in your system, the Twitter engineering team highlighted four actionable steps: collection, storage, query, and visualization.

The collection involves aggregating telemetry data, logs, metrics, and tracing, with their unique identifiers and timestamps from various endpoints in the system. After collecting data, you need to store it in a database responsible for filtering, validating, indexing, and reliably storing the data for future use. To use the data collected and stored in a database, you need to query relevant information from the storage system. "While collecting and storing the data is important, it is of no use to our engineers unless it is visualized in a way that can immediately tell a relevant story." Visualization is the last step where the stored data is queried and displayed in charts and dashboards for analysis purpose. By analyzing the visualized telemetry data, you can achieve observability in any system.

Monitoring vs Observability: What's the difference?

Monitoring is a prerequisite for observability. While monitoring deals with collecting data, observability collects, stores, queries, and visualizes these data to grant professionals an easy way of understanding the reasons behind every system's behaviour.

Monitoring gives you information about a problem or failure in your system, while observability lets you understand what caused the failure, where, and why it happened. A system that is not monitored is not observable.