Production outage: Four of five storage servers' tahoe service 2022-09-29
What happened?
Four of the currently five production storage servers were down.
@chris noted this and posted in the Slack monitoring
channel at 2022-09-29 01:59 UTC:
It looks like 4 of the production servers are currently down. Is this known/expected?
(Screenshot)
It took more than 12 hours for me to even notice that (cf my Slack message at 13:09 UTC):
Yeah, seems like we deployed my last recent breaking change and not yet the fix
😞
So this is two issues:
- Our production storage servers stopped being available (bad MTTF).
- The fix took over 12 hours We didn't notice early (bad MTTR).
Why did this happen?
- We deployed broken software.
- We didn't notice that for way too long.
What did we do about it?
After consulting with @jcalderone in a video call he and I decided that letting CI deploy latest development
branch, which already had a fix for my breaking change and nothing else in it, would be a good way to go.
I did this by merging develop
into production
in !361 (merged) on 13:28 UTC and let CI deploy production
.
What do we do about it?
(WIP)
- The current regression breaks tahoe, but not any of the monitoring
-
We need to monitor storage server availability / have application level monitoring (privatestorageops#272) -
Don't accept stale data from textfile collector (!365 (merged)) - Alert when we receive stale data https://github.com/prometheus/node_exporter/issues/713
- https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
-
-
For the generic case: Can we roll back? - Maybe even automatically?