monitoring.rst
-
Florian Sesser authoredFlorian Sesser authored
Monitoring
This section gives a high-level overview of the PrivateStorageio monitoring efforts.
Goals
- Alerting
- Something might break soon, so somebody needs to do something.
- Comparing over time or experiment groups
- Is our service slower than it was last week? Does database B answer queries faster than database A?
- Analyzing long-term trends
- How big is my database and how fast is it growing? How quickly is my daily-active user count growing?
Introduction to our dashboards
We have two groups of dashboards: Requests (external view, RED method) and Resources (internal view, USE method).
Services and their dependencies can be visualized as a tree from external-facing to internal systems. We order our dashboards like a breadth-first-search of that tree. This makes it easier to understand dependencies and faster to trouble shoot when a high-latency problem on a low-level service bubbles up.
Meaning of our metrics
Google's Monitoring Distributed Systems book about what they call the Four Golden Signals has a great explanation and definition of useful metrics:
- Latency
- Requests, but also errors take time, so don't discard them.
- Traffic
- What constitutes "Traffic" depends on the nature of your system.
- Errors
- (The rate of ) failed requests.
- Saturation
- How "full" your service is. Take action (i.e. page a human) before service degrades.
"If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring."
RED method for services ("request-scoped", "external view")
Request rate, Errors, Duration (+ Saturation?)
-
- Instrument everything that takes time and could fail
-
- "In contrast to logging, services should instrument every meaningful number available for capture." (Peter Bourgon)
-
- Plot 99th percentile, 50th percentile and average
-
- 50th and average should be close - else something is wrong
- Averages sum neatly - Service latency average should be sum of child service latencies