Skip to content
Snippets Groups Projects
Commit 9aa7ac7a authored by Florian Sesser's avatar Florian Sesser
Browse files

Add forgotten files

parent af50fe86
No related branches found
No related tags found
1 merge request!58Docs: Ops: Add preliminary monitoring documentation
<svg version="1.1" baseProfile="full" width="213.5" height="294" viewbox="0 0 213.5 294" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ev="http://www.w3.org/2001/xml-events" style="font:bold 12pt Helvetica, Helvetica, sans-serif;;stroke-linejoin:round;stroke-linecap:round">
<title >DAG of services</title>
<desc >#title: DAG of services
[&lt;actor&gt; start]-&gt;[1]
[1]--&gt;[2]
[1]--&gt;[3]
[1]--&gt;[4]
[4]--&gt;[5]
[4]--&gt;[6]
[4]--&gt;[7]</desc>
<rect x="0" y="0" height="294" width="213.5" style="stroke:none; fill:transparent;"></rect>
<path d="M58.5 68.5 L58.5 88.5 L58.5 108.5 L58.5 108.5 " style="stroke:#33322E;fill:none;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M53.2 95.2 L58.5 101.8 L63.8 95.2 L58.5 108.5 Z" style="stroke:#33322E;fill:#33322E;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M46 137.7 L26 159.5 L26 179.5 L26 179.5 " style="stroke:#33322E;fill:none;stroke-dasharray:6 6;stroke-width:3;"></path>
<path d="M20.7 166.2 L26 172.8 L31.3 166.2 L26 179.5 Z" style="stroke:#33322E;fill:#33322E;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M71 137.7 L91 159.5 L91 179.5 L91 179.5 " style="stroke:#33322E;fill:none;stroke-dasharray:6 6;stroke-width:3;"></path>
<path d="M85.7 166.2 L91 172.8 L96.3 166.2 L91 179.5 Z" style="stroke:#33322E;fill:#33322E;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M71 128.6 L156 159.5 L156 179.5 L156 179.5 " style="stroke:#33322E;fill:none;stroke-dasharray:6 6;stroke-width:3;"></path>
<path d="M150.7 166.2 L156 172.8 L161.3 166.2 L156 179.5 Z" style="stroke:#33322E;fill:#33322E;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M143.5 199.6 L58.5 230.5 L58.5 250.5 L58.5 250.5 " style="stroke:#33322E;fill:none;stroke-dasharray:6 6;stroke-width:3;"></path>
<path d="M53.2 237.2 L58.5 243.8 L63.8 237.2 L58.5 250.5 Z" style="stroke:#33322E;fill:#33322E;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M143.5 208.7 L123.5 230.5 L123.5 250.5 L123.5 250.5 " style="stroke:#33322E;fill:none;stroke-dasharray:6 6;stroke-width:3;"></path>
<path d="M118.2 237.2 L123.5 243.8 L128.8 237.2 L123.5 250.5 Z" style="stroke:#33322E;fill:#33322E;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M168.5 208.7 L188.5 230.5 L188.5 250.5 L188.5 250.5 " style="stroke:#33322E;fill:none;stroke-dasharray:6 6;stroke-width:3;"></path>
<path d="M183.2 237.2 L188.5 243.8 L193.8 237.2 L188.5 250.5 Z" style="stroke:#33322E;fill:#33322E;stroke-dasharray:none;stroke-width:3;"></path>
<circle r="4" cx="58.5" cy="25.5" data-name="start" style="stroke:#33322E;fill:#eee8d5;stroke-dasharray:none;stroke-width:3;"></circle>
<path d="M58.5 29.5 L58.5 37.5" data-name="start" style="stroke:#33322E;fill:none;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M54.5 33.5 L62.5 33.5" data-name="start" style="stroke:#33322E;fill:none;stroke-dasharray:none;stroke-width:3;"></path>
<path d="M54.5 41.5 L58.5 37.5 L62.5 41.5" data-name="start" style="stroke:#33322E;fill:none;stroke-dasharray:none;stroke-width:3;"></path>
<text x="59" y="59" style="fill: #33322E;font:normal 12pt Helvetica, Helvetica, sans-serif;text-anchor: middle;" data-name="start">start</text>
<rect x="46.5" y="108.5" height="31" width="25" data-name="1" style="stroke:#33322E;fill:#eee8d5;stroke-dasharray:none;stroke-width:3;"></rect>
<text x="59" y="130" style="fill: #33322E;font:bold 12pt Helvetica, Helvetica, sans-serif;text-anchor: middle;" data-name="1">1</text>
<rect x="13.5" y="179.5" height="31" width="25" data-name="2" style="stroke:#33322E;fill:#eee8d5;stroke-dasharray:none;stroke-width:3;"></rect>
<text x="26" y="201" style="fill: #33322E;font:bold 12pt Helvetica, Helvetica, sans-serif;text-anchor: middle;" data-name="2">2</text>
<rect x="78.5" y="179.5" height="31" width="25" data-name="3" style="stroke:#33322E;fill:#eee8d5;stroke-dasharray:none;stroke-width:3;"></rect>
<text x="91" y="201" style="fill: #33322E;font:bold 12pt Helvetica, Helvetica, sans-serif;text-anchor: middle;" data-name="3">3</text>
<rect x="143.5" y="179.5" height="31" width="25" data-name="4" style="stroke:#33322E;fill:#eee8d5;stroke-dasharray:none;stroke-width:3;"></rect>
<text x="156" y="201" style="fill: #33322E;font:bold 12pt Helvetica, Helvetica, sans-serif;text-anchor: middle;" data-name="4">4</text>
<rect x="46.5" y="250.5" height="31" width="25" data-name="5" style="stroke:#33322E;fill:#eee8d5;stroke-dasharray:none;stroke-width:3;"></rect>
<text x="59" y="272" style="fill: #33322E;font:bold 12pt Helvetica, Helvetica, sans-serif;text-anchor: middle;" data-name="5">5</text>
<rect x="111.5" y="250.5" height="31" width="25" data-name="6" style="stroke:#33322E;fill:#eee8d5;stroke-dasharray:none;stroke-width:3;"></rect>
<text x="124" y="272" style="fill: #33322E;font:bold 12pt Helvetica, Helvetica, sans-serif;text-anchor: middle;" data-name="6">6</text>
<rect x="176.5" y="250.5" height="31" width="25" data-name="7" style="stroke:#33322E;fill:#eee8d5;stroke-dasharray:none;stroke-width:3;"></rect>
<text x="189" y="272" style="fill: #33322E;font:bold 12pt Helvetica, Helvetica, sans-serif;text-anchor: middle;" data-name="7">7</text>
</svg>
\ No newline at end of file
Monitoring
==========
This section gives a high-level overview of the PrivateStorageIO monitoring efforts.
Goals
`````
Alerting
Something might break soon, so somebody needs to do something.
Comparing over time or experiment groups
Is our service slower than it was last week? Does database B answer queries faster than database A?
Analyzing long-term trends
How big is my database and how fast is it growing? How quickly is my daily-active user count growing?
Introduction to our dashboards
``````````````````````````````
We have two groups of dashboards: Requests (external view, RED method) and Resources (internal view, USE method).
Services and their dependencies can be visualized as a tree from external-facing to internal systems.
.. figure:: DAG-of-services.svg
:width: 200px
:align: center
:alt: DAG of services, created using https://www.nomnoml.com/
:figclass: align-center
DAG of services
We order our dashboards like a breadth-first-search of that tree.
This makes it easier to understand dependencies and faster to trouble shoot when a high-latency problem on a low-level service bubbles up.
Meaning of our metrics
``````````````````````
Google's *Monitoring Distributed Systems* book about what they call the `Four Golden Signals <https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals>`_ has a great explanation and definition of useful metrics:
Latency
Requests, but also errors take time, so don't discard them.
Traffic
What constitutes "Traffic" depends on the nature of your system.
Errors
(The *rate* of ) failed requests.
Saturation
How "full" your service is. Take action (i.e. page a human) before service degrades.
"If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring."
RED method for services ("request-scoped", "external view")
```````````````````````````````````````````````````````````
Request rate, Errors, Duration (+ Saturation?)
* Instrument everything that takes time and could fail
* "In contrast to logging, services should instrument every meaningful number available for capture." (Peter Bourgon)
* Plot 99th percentile, 50th percentile and average
* 50th and average should be close - else something is wrong
* Averages sum neatly - Service latency average should be sum of child service latencies
USE method for resources ("resource-scoped", "internal view")
`````````````````````````````````````````````````````````````
Utilization, Saturation, Errors:
* CPUs
* Saturation (Per core? So we catch maxxed out single cores? Would that work?)
* Memory
* Capacity
* Network
* Disks
* Capacity
* I/O
* Interconnects (Really? Hard to measure & shouldn't be the problem most of the time...)
* Software resources
* File descriptors?
* Mutex locks?
* Process / thread capacity?
* Thread pools?
Logging
```````
Peter Bourgon has a lot of wise things to say about logging in `his brilliant article Logging v. instrumentation <https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html#:~:text=Instrumentation%20is%20for%20all%20remaining,meaningful%20number%20available%20for%20capture.>`_.
* "[S]ervices should only log actionable information. That includes serious, panic-level errors that need to be consumed by humans, or structured data that needs to be consumed by machines."
* "Logs read by humans should be sparse, ideally silent if nothing is going wrong. Logs read by machines should be well-defined, ideally with a versioned schema."
* "A (service) never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream, unbuffered, to stdout."
* "Finally, understand that logging is expensive." and "Resist the urge to log any information that doesn’t meet the above criteria. As a concrete example, logging each incoming HTTP request is almost certainly a mistake."
Alerts
``````
Nobody likes being alerted needlessly.
Don't give *Alert Fatigue* a chance!
Rob Ewaschuk gives some great advice in Google's `Monitoring Distributed Systems <https://sre.google/sre-book/monitoring-distributed-systems/#tying-these-principles-together-nqsJfw>`_: "a good starting point for writing or reviewing a new alert":
- Only alert on actionable and urgent events that negatively affect users consistently and that cannot wait & cannot be automated.
- Page one person at a time.
- Do only page on novel problems.
See also
````````
This methodology was inspired by (inter alia)
* `Brendan Gregg: The Utilization Saturation and Errors (USE) Method. 2007. <http://www.brendangregg.com/usemethod.html>`_
* `Rob Ewaschuk, Betsy Beyer: Monitoring Distributed Systems. 2017. <https://sre.google/sre-book/monitoring-distributed-systems/>`_ The Four Golden Signals SRE Book by Google.
* `Tom Wilkie (Kausal): The RED method. How To Instrument Your Services. Feb 2018. <https://www.youtube.com/watch?v=9dRSYjBPaZM>`_
* `Steve Mushero: How to Monitor the SRE Golden Signals. Nov 10, 2017. <https://steve-mushero.medium.com/linuxs-sre-golden-signals-af5aaa26ebae>`_
* `Cindy Sridharan: Logs and Metrics. Apr 30, 2017. <https://copyconstruct.medium.com/logs-and-metrics-6d34d3026e38>`_
* `Peter Bourgon: Logging v. instrumentation. 2016 02 07. <https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html#:~:text=Instrumentation%20is%20for%20all%20remaining,meaningful%20number%20available%20for%20capture.>`_ What not to log.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment