Skip to content

feat: Expose Sketch Memory Footprint#136

Open
tabac wants to merge 2 commits into
apache:mainfrom
tabac:sketch-mem-size
Open

feat: Expose Sketch Memory Footprint#136
tabac wants to merge 2 commits into
apache:mainfrom
tabac:sketch-mem-size

Conversation

@tabac
Copy link
Copy Markdown
Contributor

@tabac tabac commented May 28, 2026

Introduce a size() method that returns the memory footprint of a sketch. Add implementations for HLL, CPC.

If the approach is what we're looking for I can add implementations for all sketches, either here or in a follow up PR.

Addresses #135.

@tabac tabac changed the title Expose Sketch Memory Footprint feat: Expose Sketch Memory Footprint May 28, 2026
@tabac
Copy link
Copy Markdown
Contributor Author

tabac commented May 28, 2026

@tisonkun does this approach look OK to you? If so I can update the remaining sketches.

Copy link
Copy Markdown
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

I have two comments:

  1. Naming is important. We can spend a few more time on considering names. So far, I may prefer estimated_bytes because (1) it clearly describes the unit is bytes (2) we may not always have the accurate size so leaving it as a hint should be better; this is what this info is typically used IMO. (Disclose: the name estimated_bytes is used by ScopeDB for similar cases.)

  2. I'd ping @leerho @AlexanderSaydakov @proost @freakyzoidberg for information because this API can be helpful for other (lang) datasketches impls.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a public size(&self) -> usize method that reports the in-memory footprint (stack + heap) of HllSketch and CpcSketch, addressing issue #135. Internal heap_size() helpers are added on the backing storage types (HLL Container, AuxMap, Array4/6/8, and CPC PairTable) and combined with std::mem::size_of::<Self>() at the sketch level.

Changes:

  • Add HllSketch::size() that dispatches on Mode and sums backing-storage heap with size_of::<Self>().
  • Add CpcSketch::size() that sums sliding_window + optional PairTable heap with size_of::<Self>().
  • Add heap_size() helpers on Container, AuxMap, Array4, Array6, Array8, and PairTable.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
datasketches/src/hll/sketch.rs Adds public HllSketch::size() that aggregates per-mode heap size with size_of::<Self>().
datasketches/src/hll/container.rs Adds Container::heap_size() counting backing coupon slice.
datasketches/src/hll/aux_map.rs Adds AuxMap::heap_size() counting backing entries slice.
datasketches/src/hll/array4.rs Adds Array4::heap_size() summing main bytes and optional aux map.
datasketches/src/hll/array6.rs Adds Array6::heap_size() returning packed-bytes length.
datasketches/src/hll/array8.rs Adds Array8::heap_size() returning per-slot bytes length.
datasketches/src/cpc/sketch.rs Adds public CpcSketch::size() summing sliding window and surprising-value-table heap.
datasketches/src/cpc/pair_table.rs Adds PairTable::heap_size() counting slots storage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +456 to +461
let heap_size = self.sliding_window.len()
+ self
.surprising_value_table
.as_ref()
.map(|t| t.heap_size())
.unwrap_or(0);
Comment on lines +250 to +253
/// Returns the size of the heap allocations in bytes
pub fn heap_size(&self) -> usize {
self.slots.len() * std::mem::size_of::<u32>()
}
@tisonkun
Copy link
Copy Markdown
Member

Call it footprint is also possible, but it can require too much cabal knowledge to get the idiom.

heap_size fails to (1) specify the unit (2) speak too many details (heap).

@tabac
Copy link
Copy Markdown
Contributor Author

tabac commented May 29, 2026

Generally LGTM.

I have two comments:

  1. Naming is important. We can spend a few more time on considering names. So far, I may prefer estimated_bytes because (1) it clearly describes the unit is bytes (2) we may not always have the accurate size so leaving it as a hint should be better; this is what this info is typically used IMO. (Disclose: the name estimated_bytes is used by ScopeDB for similar cases.)

Makes sense to me, just to make a couple of naming suggestions: estimated_size_bytes is one, the other being mem_footprint but I prefer the first one.

@tisonkun
Copy link
Copy Markdown
Member

tisonkun commented May 29, 2026

estimated_size_bytes

A bit repeated? estimated_bytes generally implies "size" or actually estimated_size with a comment saying it's in bytes should work. Just heap_size sounds to me too technical :P

@tabac
Copy link
Copy Markdown
Contributor Author

tabac commented May 29, 2026

estimated_size_bytes

A bit repeated? estimated_bytes generally implies "size" or actually estimated_size with a comment saying it's in bytes should work. Just heap_size sounds to me too technical :P

OK, I can push a commit with the rename. Just to make sure: we want both size() and heap_size() to be renamed to estimated_size() right?

@tisonkun
Copy link
Copy Markdown
Member

estimated_size_bytes

A bit repeated? estimated_bytes generally implies "size" or actually estimated_size with a comment saying it's in bytes should work. Just heap_size sounds to me too technical :P

OK, I can push a commit with the rename. Just to make sure: we want both size() and heap_size() to be renamed to estimated_size() right?

That's correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants