Skip to content

feat: add frequencies sketches#44

Merged
leerho merged 8 commits into
apache:mainfrom
PsiACE:freq
Dec 30, 2025
Merged

feat: add frequencies sketches#44
leerho merged 8 commits into
apache:mainfrom
PsiACE:freq

Conversation

@PsiACE
Copy link
Copy Markdown
Member

@PsiACE PsiACE commented Dec 26, 2025

  • support FrequentItemsSketch
  • add baisc test cases and serde tests

@PsiACE
Copy link
Copy Markdown
Member Author

PsiACE commented Dec 26, 2025

only 1 serde case failed on windows, ... on windows, long is 32‑bit......

@jmalkin
Copy link
Copy Markdown

jmalkin commented Dec 26, 2025

Probably don’t need to bother with a dedicated implementation for longs. That was mostly for initial testing, I think. And if rust supports templates/generics with primitives then there’s no reason for a separate implementation for longs. We didn’t do one for c++, just provided a default serde for a few basic types.

@PsiACE
Copy link
Copy Markdown
Member Author

PsiACE commented Dec 26, 2025

you're right.

just provided a default serde for a few basic types.

I'll mark this pr as a draft for now. Later, I'll remove longsketch, update the tests, and skip the serde errors on Windows.

@PsiACE PsiACE marked this pull request as draft December 26, 2025 18:20
@PsiACE PsiACE marked this pull request as ready for review December 26, 2025 18:37
@leerho
Copy link
Copy Markdown
Member

leerho commented Dec 26, 2025

I need to see some Rust reviewers before I approve & merge this.

@tisonkun
Copy link
Copy Markdown
Member

I'll take a look here before the new year :P

Comment thread datasketches/src/frequencies/mod.rs
Comment thread datasketches/src/frequencies/reverse_purge_item_hash_map.rs Outdated
Comment thread datasketches/src/frequencies/sketch.rs
Comment thread datasketches/src/frequencies/sketch.rs Outdated
Comment thread datasketches/src/frequencies/sketch.rs Outdated
Comment thread datasketches/src/frequencies/mod.rs
Comment thread datasketches/src/frequencies/serde.rs Outdated
Comment thread datasketches/src/frequencies/serde.rs Outdated
Comment thread datasketches/src/frequencies/serde.rs Outdated
Comment thread datasketches/src/frequencies/reverse_purge_item_hash_map.rs Outdated
Comment thread datasketches/src/frequencies/sketch.rs Outdated
Comment thread datasketches/src/frequencies/sketch.rs Outdated
Copy link
Copy Markdown
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. Mostly style comments to keep our codebase consistent.

@leerho leerho requested a review from tisonkun December 30, 2025 01:47
Copy link
Copy Markdown
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leerho There are several further potential improvements. But those can be follow-ups.

I'd leave it to @PsiACE to decide whether to include it in this patch or in a follow-up PR.

Comment thread datasketches/src/frequencies/sketch.rs Outdated
Copy link
Copy Markdown
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PsiACE The main branch now includes the new general error type. So you need to merge the main branch to resolve logic conflicts.

@leerho
Copy link
Copy Markdown
Member

leerho commented Dec 30, 2025

Waiting for @PsiACE

@PsiACE
Copy link
Copy Markdown
Member Author

PsiACE commented Dec 30, 2025

I have now completed the updates. This PR should be safe to merge. cc @tisonkun @leerho

Copy link
Copy Markdown
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You may leverage the new codec utils in a follow-up PR.

if let Err(err) = sketch {
assert!(matches!(err, SerdeError::InsufficientData(_)));
assert_eq!(err.kind(), ErrorKind::InvalidData);
assert!(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: googletest has a contains matcher that can provide a better error message without manual construct.

https://docs.rs/googletest/latest/googletest/matchers/fn.contains.html

Comment on lines +144 to +145
let value = self.hash_map.get(item);
value.max(0) as u64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cpp impl uses u64 for reverse_purge_hash_map's value. But let's review this point as a follow-up.

Read https://github.com/apache/datasketches-cpp/blob/7bb979d3ef8929e235bcd22d67579e1f695f6ecd/fi/include/reverse_purge_hash_map_impl.hpp#L346-L364 especially for how subtract_and_keep_positive_only replaces:

        self.adjust_all_values_by(-median);
        self.keep_only_positive_counts();

This is the only usage where i64 is needed in the current impl.

@leerho leerho merged commit 38fad36 into apache:main Dec 30, 2025
9 checks passed
@tisonkun
Copy link
Copy Markdown
Member

Cross-reference to #35.

I don't know whether this sketch works well in approx_topk scenario. I'll try.

@PsiACE PsiACE deleted the freq branch January 5, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants