Welcome to hashdial’s documentation!

Implements a hash dial for hash based decision making.

Implements, through hashing, decision making that is deterministic on input, but probabilistic across a set of inputs.

For example, suppose a set of components in a distributed system wish to emit a log entry for 1% of requests - but each component should log the same 1% of requests, they could do so as such:

if hashdial.decide(request.id, 0.01):
    log_request(request)

Seeds

All functions take an optional seed keyword argument. It is intended to be used in cases where different uses of the library require orthogonal decision making, or it is desirable to make the decision making unpredictable. In particular:

  • Avoiding untrusted input being tailored to be biased with respect to the hashing algorithm requires use of a seed that is not known to the untrusted source.

  • Filtering data which is the output of a previous filtering step using the same mechansim, requires use of a different seed in order to get correct behavior.

For example, filtering to keep 1% of lines in a file followed by applying the same filter again will result in no change in output relative to just filtering once - since line that was kept the first time will also be kept the second time.

Determinism across versions

Any change to an existing function (including default seed and choice of hashing algorithm) that would alter the output of the function given the same input, will not be done without a major version bump to the library.

API

hashdial.decide(key: bytes, probability: float, *, seed: bytes = b'') → bool

Decide between True and False` basd on key such that the probability of True for a given input over a large set of unique inputs is probability.

For example, to retain 25% of lines read from stdin:

for line in sys.stdin:
    if decide(line.encode('utf-8'), 0.25):
        sys.stdout.write(line)
Parameters
  • key – The bytes to hash.

  • probability – The probability of a given key returning True. Must be in range [0, 1].

  • seed – Seed to hash prior to hashing key.

Returns

Whether to take the action.

hashdial.range(key: bytes, stop: int, *, start: int = 0, seed: bytes = b'') → int

Select an integer in range [start, stop) by hashing key.

Example partitioned filtering of a workload on stdin assuming this is partition 3 out of 10:

for line in sys.stdin:
    if range(line.encode('utf-8'), 10) == 3:
        sys.stdout.write(line)

The difference between stop and start must be sufficiently small to be exactly representable as a float (no larger than 2**(sys.float_info.mant_dig) - 1).

Parameters
  • key – The bytes to hash.

  • stop – The exclusive end of the range of integers among which to select.

  • start – The inclusive start of the range of integers among which to select.

  • seed – Seed to hash prior to hashing key.

Returns

The selected integer.

hashdial.select(key: bytes, seq: Sequence[BucketType], *, seed: bytes = b'') → BucketType

Select one of the elements in seq based on the hash of key.

Example partitioning of input on stdin into buckets:

bucketed_lines = {}  # type: Dict[int, str]
for line in sys.stdin:
    buckets[choice(b, [0, 1, 2, 3, 4, 5])] = line
Parameters
  • key – The bytes to hash.

  • seq – The sequence from which to select an element. Must be non-empty.

  • seed – Seed to hash prior to hashing b.

Raises

ValueError – If seq is empty.

Returns

One of the elements in seq.