Skip to content

feat: add theta sketch (part 1)#45

Merged
leerho merged 4 commits into
apache:mainfrom
ZENOTME:theta_sketch_p1
Dec 30, 2025
Merged

feat: add theta sketch (part 1)#45
leerho merged 4 commits into
apache:mainfrom
ZENOTME:theta_sketch_p1

Conversation

@ZENOTME
Copy link
Copy Markdown
Contributor

@ZENOTME ZENOTME commented Dec 28, 2025

@ZENOTME ZENOTME force-pushed the theta_sketch_p1 branch 2 times, most recently from 8f1e59f to 38c61a0 Compare December 29, 2025 15:00
Comment on lines +199 to +208
/// Canonicalize double value for compatibility with Java
fn canonical_double(value: f64) -> i64 {
if value == 0.0 {
0i64
} else if value.is_nan() {
0x7ff8000000000000i64 // Java's Double.doubleToLongBits() NaN value
} else {
value.to_bits() as i64
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread datasketches/src/theta/mod.rs Outdated
Comment on lines +35 to +39
//! # Serialization
//!
//! Sketches can be serialized and deserialized while preserving all state, enabling
//! cross-platform sketch exchange with other Apache DataSketches implementations
//! (Java, C++, Python).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to remove this section until we implement it.

Comment thread datasketches/src/theta/mod.rs Outdated
Comment on lines +32 to +33
//! - **ThetaUnion**: For computing union of multiple theta sketches
//! - **ThetaIntersection**: For computing intersection of multiple theta sketches
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Comment thread datasketches/src/theta/sketch.rs Outdated
}

/// Get cardinality estimate
pub fn get_estimate(&self) -> f64 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn get_estimate(&self) -> f64 {
pub fn estimate(&self) -> f64 {

Similar to this comment - #44 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these resolve the errors?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you retry the workflows? Or shall I.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. The style check should be resolved by running cargo x lint --fix locally. It's a formatting issue where there are extra blank lines.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something you can do?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you retry the workflows? Or shall I.

No need. It's a permanent failure before a new commit applying the formatter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does @ZENOTME have to do it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I think with a write commit I can push a new commit to this PR but I'm yet to have one.

Currently, @ZENOTME would handle these comments and commit for updates and fixes.

Comment thread datasketches/src/theta/sketch.rs Outdated
///
/// # Panics
///
/// If lg_k is not in range [MIN_LG_K, MAX_LG_K]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// If lg_k is not in range [MIN_LG_K, MAX_LG_K]
/// If lg_k is not in range [5, 26]

Make it clear in the docs.

Comment thread datasketches/src/theta/sketch.rs Outdated
/// # Panics
///
/// If lg_k is not in range [MIN_LG_K, MAX_LG_K]
pub fn set_lg_k(mut self, lg_k: u8) -> Self {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn set_lg_k(mut self, lg_k: u8) -> Self {
pub fn lg_k(mut self, lg_k: u8) -> Self {

I suppose we can also drop the set_ prefix in a builder.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto other methods.

Comment thread datasketches/src/theta/hash_table.rs Outdated
Comment on lines +22 to +23
/// Default seed for hash function
pub const DEFAULT_SEED: u64 = 9001;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #48 merged you can leverage the shared DEFAULT_UPDATE_SEED as other impls do.

@tisonkun
Copy link
Copy Markdown
Member

@ZENOTME To be clear, you make some design decisions to mapping:

  • C++'s theta_update_sketch_base
  • Java's UpdatableThetaSketch

.. to this ThetaHashTable and wrap it into a ThetaSketch.

Since it's not a 1:1 mapping, may you describe the concept and struct mapping details?

I'm not quite familiar with ThetaSketch so I'd count on @leerho on verifying the design correctness.

@ZENOTME
Copy link
Copy Markdown
Contributor Author

ZENOTME commented Dec 30, 2025

Since it's not a 1:1 mapping, may you describe the concept and struct mapping details?

In C++ and Java, the class hierarchy of the theta sketch looks like this:

- ThetaSketch
 |--- UpdateThetaSketch (extend with update method)
   |--- In Java, it has more implementation classes
 |--- CompactThetaSketch (extend with serialize/deserialize method)
   |--- ...

In Rust it doesn't have inheritance. So we only have two types:

- ThetaSketch: mutable
- CompactThetaSketch: mutable

For implementation, in C++, update_theta_sketch_alloc uses theta_update_sketch_base as a hash table. I don't know why it's called xxx_base, it looks like it doesn't have inheritance. And that's why I refer to this class and rename it as ThetaHashTable.

For implementation in Java, it has more than one implementation inheriting from UpdateThetaSketch: DirectQuickSelectSketch, ConcurrentDirectQuickSelectSketch, and HeapAlphaSketch... There are different ways to abstract the implementation: ThetaSketch or wrap it as Box internally. I'm not sure we should consider this now, and is there any good practice for this? @tisonkun

@leerho
Copy link
Copy Markdown
Member

leerho commented Dec 30, 2025

Is this ready to merge?

@ZENOTME
Copy link
Copy Markdown
Contributor Author

ZENOTME commented Dec 30, 2025

Is this ready to merge?

I have fixed the review above, and it's mergeable. cc @tisonkun

Comment thread datasketches/src/theta/hash_table.rs Outdated
return false;
}

assert!(self.entries[index] == 0, "Entry should be empty");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert!(self.entries[index] == 0, "Entry should be empty");
assert_eq!(self.entries[index], 0, "Entry should be empty");

Comment thread datasketches/src/theta/hash_table.rs Outdated
Comment on lines +261 to +264
assert!(
num_inserted == k as usize,
"Number of inserted entries should be equal to k."
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert!(
num_inserted == k as usize,
"Number of inserted entries should be equal to k."
);
assert_eq!(
num_inserted, k as usize,
"Number of inserted entries should be equal to k."
);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto others in tests.

Comment thread datasketches/src/theta/hash_table.rs Outdated
Comment on lines +68 to +74
/// Specific hash table for theta sketch
///
/// It maintain a array capacity max to [2 ** (lg_max_size)]:
/// - Before it reach the max capacity, it will extend the array based on resize_factor.
/// - After it reach the capacity bigger than 2 ** (lg_nom_size), every time the number of entries
/// exceeds the threshold, it will rebuild the table: only keep the min 2 ** lg_nom_size entries
/// and update the theta to the k-th smallest entry.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Specific hash table for theta sketch
///
/// It maintain a array capacity max to [2 ** (lg_max_size)]:
/// - Before it reach the max capacity, it will extend the array based on resize_factor.
/// - After it reach the capacity bigger than 2 ** (lg_nom_size), every time the number of entries
/// exceeds the threshold, it will rebuild the table: only keep the min 2 ** lg_nom_size entries
/// and update the theta to the k-th smallest entry.
/// Specific hash table for theta sketch
///
/// It maintains an array capacity max to 2^lg_max_size:
/// * Before it reaches the max capacity, it will extend the array based on resize_factor.
/// * After it reaches the capacity bigger than 2^lg_nom_size, every time the number of entries
/// exceeds the threshold, it will rebuild the table: only keep the min 2^lg_nom_size entries
/// and update the theta to the k-th smallest entry.

Comment thread datasketches/src/theta/hash_table.rs Outdated

/// Find an entry in the hash table.
///
/// Returns the index of the entry if found, otherwise None. The entry may has been inserted or
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the index of the entry if found, otherwise None. The entry may has been inserted or
/// Returns the index of the entry if found, otherwise None. The entry may have been inserted or

Comment thread datasketches/src/theta/hash_table.rs Outdated

/// Find index in a given entries.
///
/// Returns the index of the entry if found, otherwise None. The entry may has been inserted or
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the index of the entry if found, otherwise None. The entry may has been inserted or
/// Returns the index of the entry if found, otherwise None. The entry may have been inserted or

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZENOTME seems the final missing one here.

Copy link
Copy Markdown
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Some style issues can be resolved here or as a follow-up.

@ZENOTME
Copy link
Copy Markdown
Contributor Author

ZENOTME commented Dec 30, 2025

LGTM.

Some style issues can be resolved here or as a follow-up.

resolved

@leerho leerho merged commit 0a8543e into apache:main Dec 30, 2025
9 checks passed
@leerho
Copy link
Copy Markdown
Member

leerho commented Dec 30, 2025

@ZENOTME, I saw this comment above:

In Rust it doesn't have inheritance. So we only have two types:

  • ThetaSketch: mutable //not really, but not sure how you implement it in Rust
  • CompactThetaSketch: mutable //this is not correct.

See the Quick Summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants