diff --git a/docs/source/index.rst b/docs/source/index.rst index aa38cfd14a28a782f219d95847ac099c9e6e8fb5..6ae149f7ce9845d877c953f3e095edba02455765 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -16,9 +16,16 @@ Howdy! We separated the documentation into parts addressing different audiences Developers <dev/README> +Naming +------ + +The name of this ops software project is "PrivateStorageio". +There is another ops project that deals with the surrounding/supporting infrastructure called "PrivateStorageOps". +The domain name used for the website and other network services associated with this deployment is "privatestorage.io" (but may change to "private.storage" at some point). + Indices and tables -================== +------------------ * :ref:`genindex` * :ref:`modindex` diff --git a/docs/source/ops/README.rst b/docs/source/ops/README.rst index 22b53e6590d9564a4635c03dc3cd6fd8b982c5cc..8007d8dfbb8a42d27aed4fb014423b7f91742252 100644 --- a/docs/source/ops/README.rst +++ b/docs/source/ops/README.rst @@ -1,9 +1,11 @@ -Adminstrator documentation -========================== +Administrator documentation +########################### -This contains documentation regarding running PrivateStorageIo. +This contains documentation regarding running PrivateStorageio. .. include:: ../../../morph/README.rst - :start-line: 9 + +.. include:: + monitoring.rst diff --git a/docs/source/ops/monitoring.rst b/docs/source/ops/monitoring.rst new file mode 100644 index 0000000000000000000000000000000000000000..e30831ade4ef71a0abd101e383065f706f586b63 --- /dev/null +++ b/docs/source/ops/monitoring.rst @@ -0,0 +1,129 @@ +Monitoring +========== + +This section gives a high-level overview of the PrivateStorageio monitoring efforts. + + +Goals +````` + +Alerting + Something might break soon, so somebody needs to do something. + +Comparing over time or experiment groups + Is our service slower than it was last week? Does database B answer queries faster than database A? + +Analyzing long-term trends + How big is my database and how fast is it growing? How quickly is my daily-active user count growing? + + +Introduction to our dashboards +`````````````````````````````` + +We have two groups of dashboards: Requests (external view, RED method) and Resources (internal view, USE method). + +Resources like CPU and memory exist independently of one another (at least in theory) and their corresponding dashboards are listed in arbitrary order. + +Services, on the other hand, often directly depend on other services: +A request might cause sub-requests, which in turn might call other services. +These dependencies can be visualized as a DAG (directed acyclic graph, like a tree but with directed edges) from external-facing to internal systems. + +When a service fails, and an Alert is triggered, often the services which depend on the failing service will fail and trigger Alerts as well. +This can cause confusion and cost valuable time especially when the current on-call staff is not familiar with the inner workings of a particular machinery. + +To mitigate this problem, we order our dashboards to resemble these dependencies according to a `breadth-first-search <https://en.wikipedia.org/wiki/Breadth-first_search>`_ of the service dependency DAG: + +.. graphviz:: service-dag-to-dashboard-order.dot + :caption: DAG of services to resulting order of corresponding dashboards + +This makes finding the first failing link, and thus the cause of the problem, quicker: +Problems of a failing service lowest in the DAG bubble "upwards". +Therefore, the "lowest" dashboard that indicates a problem has a high probability of highlighting the origin of the cascading failures. + + +Meaning of our metrics +`````````````````````` + +Google's *Monitoring Distributed Systems* book about what they call the `Four Golden Signals <https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals>`_ has a great explanation and definition of useful metrics: + +Latency + Requests, but also errors take time, so don't discard them. + +Traffic + What constitutes "Traffic" depends on the nature of your system. + +Errors + (The *rate* of ) failed requests. + +Saturation + How "full" your service is. Take action (i.e. page a human) before service degrades. + +"If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring." + + +RED method for services ("request-scoped", "external view") +``````````````````````````````````````````````````````````` + +Request rate, Errors, Duration (+ Saturation?) + +* Instrument everything that takes time and could fail + * "In contrast to logging, services should instrument every meaningful number available for capture." (Peter Bourgon) + +* Plot 99th percentile, 50th percentile and average + * 50th and average should be close - else something is wrong + * Averages sum neatly - Service latency average should be sum of child service latencies + + +USE method for resources ("resource-scoped", "internal view") +````````````````````````````````````````````````````````````` + +Utilization, Saturation, Errors: + +* CPU saturation (Idea: max saturation value per machine, since our load is mostly single-core) +* Memory saturation +* Network saturation +* Disks + * Storage capacity + * I/O saturation + +* Software resources + * File descriptors + + +Logging +``````` + +Peter Bourgon has a lot of wise things to say about logging in `his brilliant article Logging v. instrumentation <https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html#:~:text=Instrumentation%20is%20for%20all%20remaining,meaningful%20number%20available%20for%20capture.>`_. + +* "[S]ervices should only log actionable information. That includes serious, panic-level errors that need to be consumed by humans, or structured data that needs to be consumed by machines." +* "Logs read by humans should be sparse, ideally silent if nothing is going wrong. Logs read by machines should be well-defined, ideally with a versioned schema." +* "A (service) never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream, unbuffered, to stdout." +* "Finally, understand that logging is expensive." and "Resist the urge to log any information that doesn’t meet the above criteria. As a concrete example, logging each incoming HTTP request is almost certainly a mistake." + + +Alerts +`````` + +Nobody likes being alerted needlessly. +Don't give *Alert Fatigue* a chance! + +Rob Ewaschuk gives some great advice in Google's `Monitoring Distributed Systems <https://sre.google/sre-book/monitoring-distributed-systems/#tying-these-principles-together-nqsJfw>`_: "a good starting point for writing or reviewing a new alert": + +- Only alert on actionable and urgent events that negatively affect users consistently and that cannot wait & cannot be automated. +- Page one person at a time. +- Do only page on novel problems. + + +See also +```````` + +This methodology was inspired by (inter alia) + +* `Brendan Gregg: The Utilization Saturation and Errors (USE) Method. 2007. <http://www.brendangregg.com/usemethod.html>`_ +* `Rob Ewaschuk, Betsy Beyer: Monitoring Distributed Systems. 2017. <https://sre.google/sre-book/monitoring-distributed-systems/>`_ The Four Golden Signals SRE Book by Google. +* `Tom Wilkie (Kausal): The RED method. How To Instrument Your Services. Feb 2018. <https://www.youtube.com/watch?v=9dRSYjBPaZM>`_ +* `Steve Mushero: How to Monitor the SRE Golden Signals. Nov 10, 2017. <https://steve-mushero.medium.com/linuxs-sre-golden-signals-af5aaa26ebae>`_ + +* `Cindy Sridharan: Logs and Metrics. Apr 30, 2017. <https://copyconstruct.medium.com/logs-and-metrics-6d34d3026e38>`_ +* `Peter Bourgon: Logging v. instrumentation. 2016 02 07. <https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html#:~:text=Instrumentation%20is%20for%20all%20remaining,meaningful%20number%20available%20for%20capture.>`_ What not to log. + diff --git a/docs/source/ops/service-dag-to-dashboard-order.dot b/docs/source/ops/service-dag-to-dashboard-order.dot new file mode 100644 index 0000000000000000000000000000000000000000..e957a8aaa7d5da2ee4136e755eec547f227b9f7f --- /dev/null +++ b/docs/source/ops/service-dag-to-dashboard-order.dot @@ -0,0 +1,29 @@ +digraph { + subgraph cluster01 { + label = "DAG of service dependencies"; + + 1->2; + 1->3; + + 3->4; + 3->5; + } + + subgraph cluster02 { + label = "Resulting order of dashboards"; + node [ shape = box ]; + edge [ style = invis ]; + + d1 [ label = 1 ]; + d2 [ label = 2 ]; + d3 [ label = 3 ]; + d4 [ label = 4 ]; + d5 [ label = 5 ]; + + d1->d2; + d2->d3; + d3->d4; + d4->d5; + } +} +