monitoring.rst



Monitoring
This section gives a high-level overview of the PrivateStorageio monitoring efforts.

Goals

Alerting
Something might break soon, so somebody needs to do something.
Comparing over time or experiment groups
Is our service slower than it was last week? Does database B answer queries faster than database A?
Analyzing long-term trends
How big is my database and how fast is it growing? How quickly is my daily-active user count growing?


Introduction to our dashboards
We have two groups of dashboards: Requests (external view, RED method) and Resources (internal view, USE method).
Services and their dependencies can be visualized as a tree from external-facing to internal systems.
We order our dashboards like a breadth-first-search of that tree.
This makes it easier to understand dependencies and faster to trouble shoot when a high-latency problem on a low-level service bubbles up.

Meaning of our metrics
Google's Monitoring Distributed Systems book about what they call the Four Golden Signals has a great explanation and definition of useful metrics:

Latency
Requests, but also errors take time, so don't discard them.
Traffic
What constitutes "Traffic" depends on the nature of your system.
Errors
(The rate of ) failed requests.
Saturation
How "full" your service is.  Take action (i.e. page a human) before service degrades.

"If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring."

RED method for services ("request-scoped", "external view")
Request rate, Errors, Duration (+ Saturation?)


Instrument everything that takes time and could fail


"In contrast to logging, services should instrument every meaningful number available for capture." (Peter Bourgon)


Plot 99th percentile, 50th percentile and average


50th and average should be close - else something is wrong
Averages sum neatly - Service latency average should be sum of child service latencies


USE method for resources ("resource-scoped", "internal view")