I guess a funny randomized algorithm is just "if snapshots.size() > capacity delete random snapshot" (depending on complexity of random deletions.) If they're in a B-tree not bad. Snapshots have a half life so they fall off exponentially.
Maybe not hard to get a deterministically good distribution (haven't properly read TFA) but I guess you'd be sad if you didn't do better than the most obvious thing.
I guess also an algorithm like "delete the (i mod n)th oldest element once you hit the capacity" also deletes every second element every n iterations, though maybe a bit uneven as it prefers to thin things out in quite local ways.
Reading again, I don't think the distribution in TFA ends up logarithmically spaced. The "first zero bit" function is biased low, so elements don't have an even half-life: elements towards one end are preferentially deleted.
Given uniformly-spaced elements that preference might initially move the distribution in the direction of log-spaced, but it won't trend closely to it. (I think...)
I solved this problem in a similar way some time ago. It was impractical to assign numbers to snapshots but each snapshot had a date assigned to it. So I computed the number of days since Unix epoch for each snapshot = n. Then the largest power of 2 which divides n. m = n &(~(n-1)) I think. Then I computed the date after which I wanted to delete a snapshot n +f*m. The factor f contros the density. I chose f = 2.
Every day there is exactly one date for which a snapshot needs to be deleted and it can be computed directly. If there is no snapshot for this date nothing needs to be done.
At first glance this looks to me like it is the so-called Tower of Hanoi method of maintaining tape backups. If you have x sets of tapes, you can keep backups weighted towards more recent copies while also having some old ones by using set 1 every other day, set 2 every 4 days, set 3 every 8 days, etc.
I've often wondered whether you could effectively use this strategy to optimize a (braces for impact) blockchain. Presumably there are some relevant time spans:
- how long ago you need blocks for preventing alternate-history shenanigans
- how long ago you need blocks for nodes that may go offline for a while and come back confused
- how long ago you need fine granularity of records for accounting reasons
If none of those timespans are "forever" then maybe there could be agreement between nodes that at some point we start summarizing. So instead of deleting snapshots, you're merging adjacent blocks and purging transitional states that are not required to tell a consistent story.
Is there a similarly simple scheme that works even if you decouple the snapshot phase from the GC phase.
Use case: I already have a list of equally spaced snapshots that I'd like to logarithmically space.
Also, it occurs to me that this (and other grandfather snapshot schemes) ultimately derive from the generational hypothesis: users want files for either a short time (transient files, tarballs, PDFs, checkouts) or a long time (photos).
You mean, you want to get rid of some (most) of your linearly spaced snapshots, so that the remainder are logarithmically spaced? Not too difficult:
- Start with the most recent snapshot; keep it
- Go on to the next snapshot; keep it, and then drop the next 2^1 - 1 = 1 snapshot
- Go on to the next snapshot; keep it, and then drop the next 2^2 - 1 = 3 snapshots
- Go on to the next snapshot; keep it, and then drop the next 2^3 - 1 = 7 snapshots
Etc.
Alternately (if it's easier to model or implement it thus), the snapshots that you keep are the ones at power-of-two indices, plus 0 (starting again from the most recent snapshot): snapshot 0, snapshot 1, snapshot 2, snapshot 4...
I've used a similar approach to save debug images in video stream processing system that basically scans vehicles going near the camera. Since the length and speed of vehicles is varying, memory is constrained, and keeping too many images of each vehicles isn't useful, a progressively decimated buffer (along with images of entry and exit events) worked well.
I had similar problem with logs retention in past and came up with shell script[1] which deletes half of least recent files if disk usage hit high water limit.
this is a very nice family of structures; perhaps you could use it to choose which previous line in a poem to rhyme with, or which previous day to reflect on
to avoid the collapse from transitive equality, I think rhyming would need to work in pairs, rather than in lines, something like AABBAACCAABBAADD... (giving an ostinato effect, similar to a ghazal?)
(...GCAABBCDEEFFDG... might be interesting to write with the quatrains displayed block inline but the longer-range rhymes just set as normal prose; it would be up to the reader to notice the maintenance of metre and rhyme in the nominally "prose" sections)
Maybe not hard to get a deterministically good distribution (haven't properly read TFA) but I guess you'd be sad if you didn't do better than the most obvious thing.
I guess also an algorithm like "delete the (i mod n)th oldest element once you hit the capacity" also deletes every second element every n iterations, though maybe a bit uneven as it prefers to thin things out in quite local ways.