Production outage: Four of five storage servers' tahoe service 2022-09-29

What happened?

Four of the currently five production storage servers were down. @chris noted this and posted in the Slack monitoring channel at 2022-09-29 01:59 UTC:

It looks like 4 of the production servers are currently down. Is this known/expected?
(Screenshot)

It took more than 12 hours for me to even notice that (cf my Slack message at 13:09 UTC):

Yeah, seems like we deployed my last recent breaking change and not yet the fix

So this is two issues:

Our production storage servers stopped being available (bad MTTF).
The fix took over 12 hours We didn't notice early (bad MTTR).

Why did this happen?

We deployed broken software.
We didn't notice that for way too long.

What did we do about it?

After consulting with @jcalderone in a video call he and I decided that letting CI deploy latest development branch, which already had a fix for my breaking change and nothing else in it, would be a good way to go. I did this by merging develop into production in !361 (merged) on 13:28 UTC and let CI deploy production.

What do we do about it?

(WIP)

The current regression breaks tahoe, but not any of the monitoring
- We need to monitor storage server availability / have application level monitoring (privatestorageops#272)
- Don't accept stale data from textfile collector (!365 (merged))
  - Alert when we receive stale data https://github.com/prometheus/node_exporter/issues/713
  - https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
For the generic case: Can we roll back?
- Maybe even automatically?

1 of 3 checklist items completed · Edited 2 years ago

Production outage: Four of five storage servers' tahoe service 2022-09-29

What happened?

Why did this happen?

What did we do about it?

What do we do about it?

Child items ...

Activity