Skip to content
Snippets Groups Projects
Select Git revision
1 result Searching

monitoring.rst

Blame
  • monitoring.rst 5.36 KiB

    Monitoring

    This section gives a high-level overview of the PrivateStorageio monitoring efforts.

    Goals

    Alerting
    Something might break soon, so somebody needs to do something.
    Comparing over time or experiment groups
    Is our service slower than it was last week? Does database B answer queries faster than database A?
    Analyzing long-term trends
    How big is my database and how fast is it growing? How quickly is my daily-active user count growing?

    Introduction to our dashboards

    We have two groups of dashboards: Requests (external view, RED method) and Resources (internal view, USE method).

    Services and their dependencies can be visualized as a tree from external-facing to internal systems. We order our dashboards like a breadth-first-search of that tree. This makes it easier to understand dependencies and faster to trouble shoot when a high-latency problem on a low-level service bubbles up.

    Meaning of our metrics

    Google's Monitoring Distributed Systems book about what they call the Four Golden Signals has a great explanation and definition of useful metrics:

    Latency
    Requests, but also errors take time, so don't discard them.
    Traffic
    What constitutes "Traffic" depends on the nature of your system.
    Errors
    (The rate of ) failed requests.
    Saturation
    How "full" your service is. Take action (i.e. page a human) before service degrades.

    "If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring."

    RED method for services ("request-scoped", "external view")

    Request rate, Errors, Duration (+ Saturation?)

    • Instrument everything that takes time and could fail
      • "In contrast to logging, services should instrument every meaningful number available for capture." (Peter Bourgon)
    • Plot 99th percentile, 50th percentile and average
      • 50th and average should be close - else something is wrong
      • Averages sum neatly - Service latency average should be sum of child service latencies

    USE method for resources ("resource-scoped", "internal view")