Server-side metric for spent ZKAPs for Prometheus
Fixes #259 (closed)
This is broken down into a histogram by number of ZKAPs spent. This correlates relatively well with actual size. It's also simpler than trying to fix two dimensions into one histogram (file size and ZKAPs spent). If this provides insufficient visibility we can add another metric but let's see how it works in practice first.
Merge request reports
Activity
Created by: codecov[bot]
Codecov Report
Merging #264 (378cf912) into main (4dbab153) will decrease coverage by
2.37%
. The diff coverage is33.81%
.@@ Coverage Diff @@ ## main #264 +/- ## ========================================== - Coverage 94.36% 91.99% -2.38% ========================================== Files 50 50 Lines 4226 4360 +134 Branches 540 558 +18 ========================================== + Hits 3988 4011 +23 - Misses 195 306 +111 Partials 43 43
Impacted Files Coverage Δ src/_zkapauthorizer/tests/test_storage_server.py 32.58% <11.94%> (-12.47%)
src/_zkapauthorizer/_plugin.py 68.33% <29.62%> (-18.91%)
src/_zkapauthorizer/tests/test_plugin.py 85.79% <43.47%> (-14.21%)
src/_zkapauthorizer/_storage_server.py 92.05% <95.23%> (+0.26%)
src/_zkapauthorizer/storage_common.py 96.92% <100.00%> (ø)
src/_zkapauthorizer/tests/foolscap.py 96.92% <0.00%> (-1.54%)
src/_zkapauthorizer/tests/strategies.py 95.49% <0.00%> (-0.46%)
src/_zkapauthorizer/model.py 94.21% <0.00%> (+0.68%)
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 4dbab15...378cf91. Read the comment docs.17 17 Tahoe-LAFS. 18 18 """ 19 19 20 from __future__ import absolute_import 21 20 22 import random 21 23 from datetime import datetime 22 24 from functools import partial 23 25 from weakref import WeakValueDictionary 24 26 27 try: 28 from typing import Callable 218 return list(2 ** n for n in range(11)) + [float("inf")] 219 220 def __attrs_post_init__(self): 221 self._metric_spending_successes = Histogram( 222 "zkapauthorizer_server_spending_successes", 223 "ZKAP Spending Successes histogram", 224 registry=self._registry, 225 buckets=self._get_buckets(), 226 ) 227 228 def _clear_metrics(self): 229 """ 230 Forget all recorded metrics. 231 """ 232 # There is also a `clear` method but it raises an AttributeError. 233 self._metric_spending_successes._metric_init() Created by: tomprince
The docstring for
clear
says "Remove all labelsets from the metric", which somewhat suggests it is only intended to be used with metrics that have labels. That said, it seems reasonable to want to use it with metrics that don't have labels as well. Is there a bug upstream for this?
236 286 allocated_size, 237 287 ) 238 288 289 # XXX Some shares may exist already so those passes aren't necessarily 290 # spent... 291 self._metric_spending_successes.observe(len(validation.valid)) 455 514 456 :return: ``None`` if the given number of passes is sufficient. 515 :return: A mapping from share number to share size on the server if the 516 number of passes given is sufficient. 457 517 """ 458 518 allocated_sizes = dict( 459 519 get_share_sizes( 460 520 storage_server, 461 521 storage_index, 462 522 list(get_all_share_numbers(storage_server, storage_index)), 463 523 ), 464 ).values() 465 check_pass_quantity(pass_value, validation, allocated_sizes) 524 ) 525 check_pass_quantity(pass_value, validation, allocated_sizes.values()) 526 return allocated_sizes 139 139 ) 140 140 141 141 142 def read_count(storage_server): 143 buckets = storage_server._metric_spending_successes._buckets 144 return sum(b.get() for b in buckets) 145 146 147 def read_bucket(storage_server, size): 148 bounds = storage_server._get_buckets() 149 for bucket_number, upper_bound in enumerate(bounds): 150 if size <= upper_bound: 151 break 152 153 note("bucket_number {}".format(bucket_number)) 154 buckets = storage_server._metric_spending_successes._buckets Created by: tomprince
We could use
.collect()
here to avoid looking at implementation details, but its probably reasonable to use them here. It'd definitely be more work to extract the necessary details from the result of.collect()
.We should maybe also file a bug upstream, about providing better tools for testing.
We should maybe also file a bug upstream, about providing better tools for testing.
Oh yea. This. Filed https://github.com/prometheus/client_python/issues/736 and linked to it from the
_bucket
-touching code.
639 allocated_size=size, 640 canary=LocalReferenceable(None), 641 ), 642 ) 643 644 after_count = read_count(self.storage_server) 645 after_bucket = read_bucket(self.storage_server, num_passes) 646 647 self.expectThat( 648 after_count, 649 Equals(1), 650 "Unexpected histogram sum value", 651 ) 652 self.assertThat( 653 after_bucket, 654 Equals(1), That's reasonable. I started to do this and realized I gave an incorrect explanation -
read_count
does return the sum, but it's the sum we calculate by looking at all of the buckets, not the sum prometheus_client calculates. So the other buckets must be 0 orread_count
(nowread_spending_success_histogram_total
) is broken.
524 547 actual_sizes, 525 548 Equals(expected_sizes), 526 549 ) 550 551 @given( 552 storage_index=storage_indexes(), 553 secrets=tuples( 554 write_enabler_secrets(), 555 lease_renew_secrets(), 556 lease_cancel_secrets(), 557 ), 558 test_and_write_vectors_for_shares=slot_test_and_write_vectors_for_shares(), 559 ) 560 def test_mutable_spending_metrics( Created by: tomprince
Review: Approved
This looks good, with a couple of changes (two minor and one medium):
- we are working around some prometheus-client bugs, we should link to an upstream bug for them
- There are a few names that should be more descriptive, given that we have tahoe-lafs buckets and now histogram buckets
- I'd like to avoid duplicating the happy-path test cases that already exist in
test_storage_protocol
. I think the new storage server tests here should be fairly easy to fold into the existing tests there.
Good stuff.
I agree with Tom's comments about naming - we could have been more concrete.
I'm available for another pairing session on this if you like @exarkun !
Great. I hope the last rev I pushed passes CI and I think we should merge if so. We still have the negative case metrics to implement (invalid signatures and whatever) in a follow-up PR.