Skip to content

Qualify improvements#39

Open
Vedin wants to merge 10 commits into
mainfrom
qualify-improvments
Open

Qualify improvements#39
Vedin wants to merge 10 commits into
mainfrom
qualify-improvments

Conversation

@Vedin
Copy link
Copy Markdown
Contributor

@Vedin Vedin commented May 30, 2026

This changes fix problem performance. It at least -60% of current state. There is a big problem with it because it adds more memory pressure. We still can merge it like this but current datafusion state a little bit more durable due to GreedyMemoryPool and DiskManager. Meanwhile I try to introduce somme guard of spill integration here as well.

Vedin and others added 10 commits May 29, 2026 18:21
Adds snowplow_dedup_topk_bench: runs the canonical Snowplow event
deduplication query (QUALIFY ROW_NUMBER() OVER (PARTITION BY event_id
ORDER BY collector_tstamp, dvce_created_tstamp) <= 1) over a synthetic
snowplow-shaped table, asserts the grouped top-K rewrite fires and the
row count is exact, and emits OPTIMIZED_MS (best-of-N) for benchmarking.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ial timing

Cross-run machine noise (~20%, correlated within a single process) made raw
best-of-N milliseconds unusable for keep/discard decisions. Replace the OPTIMIZED_MS
metric with OPTIMIZED_RATIO: time the native DataFusion window path (a fixed reference
that never touches in-scope code) and the GroupedTopKExec fast path back-to-back each
iteration, and report the median optimized/reference ratio. The ratio cancels machine
noise. Lower is better.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant