Snowplow results by JanKaul · Pull Request #37 · Embucket/rustice

JanKaul · 2026-05-29T04:49:30Z

add snowplow tests

Align the sqllogictest output with the snowflake-connector-python output that the bronze_scope .slt fixtures were generated against: * booleans as TRUE/FALSE * empty strings as '' (not (empty)) * floats with lowercase nan/inf/-inf and .0 preserved on whole values * binary as x'<lowercase-hex>' * date/time/timestamp/list/struct/map cells wrapped in single quotes * Utf8 cells whose content parses as a JSON array or object (VARIANT / ARRAY / OBJECT in Rustice, all stored as Utf8) re-emitted as compact JSON wrapped in single quotes Workspace sqllogictest pass count goes from 58 to 116 with no regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Decimals: preserve the column's declared scale (don't strip trailing zeros via BigDecimal normalize), so `DECIMAL(p,6)` values render as `77500.000000`. * Floats: use Rust's shortest-round-trip `{:?}` (same algorithm Python's `repr(float)` uses), so values like `1.23e-10` keep scientific notation. Fix the exponent sign to Python's `e+N` form. NaN / Infinity / -Infinity rendered as lowercase `nan` / `inf` / `-inf`. * Time / Timestamp: pad the subsecond fraction to 6 digits and truncate nanosecond precision, matching Python `datetime.isoformat()`. * VARIANT JSON: sort object keys alphabetically (serde_json's Map has `preserve_order` on transitively; emit through a recursive sort). * Strings: escape embedded `\n` as the literal `\n` (matches the python runner's `value.replace('\n', '\\n')`). * Validator: normalize the expected line too (and join actual cells with `" "` instead of `\t`), so the literal `\t` the .slt parser embeds between columns equates to the whitespace runs the upstream normalizer collapses. Workspace pass count: 116 -> 155. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* `array_value_to_json` walks the Arrow value at (col, row) and produces a `serde_json::Value`, honouring primitive types, strings, nested lists, struct fields, and map entries. * Struct fields and Map entries are emitted in alphabetical key order to match Snowflake's VARIANT JSON output (which the bronze_scope fixtures were captured against). * Arrow's default formatter previously rendered List<Utf8> as `[a, b, c]` (debug form) — a far cry from `["a","b","c"]`. This lights up SPLIT, STRTOK_TO_ARRAY, and other UDFs that return array/object cells. * Tighten the validator's whitespace handling: a cell that's only whitespace was normalising to an empty string but still emitting a separator on join, so a `[" ", "0"]` row produced `" 0"` against an expected `"0"`. Re-normalise the joined row to collapse the stray spaces. Workspace pass count: 155 -> 157. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…REGEX>: Three transforms the Python slt_runner applies that the Rust harness was missing: 1. `varchar_to_str` now substitutes `undefined` → `null` before JSON-parsing a VARIANT-shaped cell, mirroring `slt_runner/result.py:550`. 2. The `<REGEX>:` branch in `embucket_validator` now uses fullmatch (\A..\z) and DOTALL ((?s)) semantics, matching `slt_runner/result.py:422-424`. 3. `<!REGEX>:` (negative-match) is now recognised alongside `<REGEX>:`. Unit tests cover all three. Net effect on suite: pass 157→158, fail 220→219.

…elies on Six focused .slt files, one per DDL operation: - create_schema_idempotent.slt — CREATE SCHEMA IF NOT EXISTS reruns cleanly - create_or_replace_table.slt — CREATE OR REPLACE TABLE (typed and AS SELECT) actually drops & recreates, no residual rows - create_table_as_then_insert.slt — Phase A pattern: CTAS + INSERT INTO. The trailing case documents that when CTAS body and INSERT body both return the same rows you get 2x duplicates (engine-correct). - drop_then_create.slt — Phase B pattern: DROP + fresh CREATE TABLE AS - copy_into_parquet.slt — COPY INTO appends; events1 + events2 = 400 - merge_upsert.slt — MERGE INTO with key + window predicate; both WHEN MATCHED (update in place) and WHEN NOT MATCHED (insert once) branches verified All 6 pass. The DDL primitives all behave correctly — including CREATE OR REPLACE TABLE and MERGE INTO — which scopes the snowplow duplicate-row failures to the data-shape (CTAS body + INSERT body overlap) rather than the DDL itself.

The regen script bundled run-1 (sentinel-gated CTAS) and run-2 (incremental INSERT or __dbt_tmp+MERGE) into a single Phase A block. This populated events_this_run mid-Phase A, which then cascaded into 2x duplicates in every downstream scratch table (their CTAS bodies have no date filter and pull from events_this_run; their accompanying Phase A INSERTs doubled the same rows). Derived `+materialized: incremental` models accumulated a third layer because the Phase A MERGE ran against a stale April-2026 window predicate on a target the same Phase A had just pre-populated, so every source row fell through `WHEN NOT MATCHED → INSERT`. The setup now emits TWO files, one per simulated dbt run state, so both events1-only and events1+events2 verifications run independently against captured Snowflake reference values: setup.full_refresh.slt = header + Phase A + Phase B → state after `load events1.csv; dbt run`. Phase A lays down 18 empty schemas; Phase B rebuilds scratch via CREATE OR REPLACE on the events1-only events table and upserts derived via __dbt_tmp + MERGE on the empty target (every row → WHEN NOT MATCHED → INSERT). Validated by 18 leaves under full_refresh/ against captured Snowflake values in slt_results.full_refresh.txt. setup.slt = header + Phase A + events2 COPY + Phase B → state after `load events2.csv; dbt run` on top of the prior state. Same Phase A + Phase B shape, but events2 is COPYed into the events table before Phase B runs. Validated by 18 leaves under incremental/ against slt_results.incremental.txt. Single MERGE pass on an empty target is the only path that reproduces the captured Snowflake values without depending on a compile-time- aligned window: the dbt-compiled MERGE predicate is a 2-minute slice baked in at compile time, so a literal two-cycle simulation (cycle 1 MERGE on empty target then cycle 2 re-MERGE on populated target) would produce duplicate derived rows. Regen-script details: - emit_phase_a stripped to CTAS-only. - emit_phase_b switched from DROP+CREATE pairs to CREATE OR REPLACE. - MERGE source piped through sed to map SNOWPLOW_JAN.snowplow_manifest_* catalog refs to embucket.public_snowplow_manifest_* (compiled dbt SQL hard-codes the upstream Snowflake catalog) and lowercase the Snowflake-style "QUOTED_UPPERCASE_IDENT" column refs (DataFusion stores columns lowercase). Test surface: - 18 full_refresh leaves restored, then events1-only verification blocks pasted from test-dbt-snowplow-web/loader/slt_results.full_refresh.txt via the loader's apply_slt_results.py script. - 18 incremental leaves keep their events1+events2 verification blocks from slt_results.incremental.txt. - snowplow_web_incremental_manifest expected count 12 → 0 in both variants; the package's post-hook (which appends a manifest row per successful model run) isn't part of our simulation. dbt_snowplow_web: pass=36 / fail=0. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Collapse the nested if/let chain in varchar_to_str's JSON path into a single let-chain to satisfy clippy::collapsible-if (denied by clippy::all in CI).

JanKaul and others added 10 commits May 28, 2026 16:18

add snowplow results

e00d577

cargo fmt sqllogictest

95dffd6

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

clippy

1c0967e

clippy

eabfbb5

Collapse the nested if/let chain in varchar_to_str's JSON path into a single let-chain to satisfy clippy::collapsible-if (denied by clippy::all in CI).

JanKaul merged commit b62e7c6 into main May 29, 2026
5 checks passed

JanKaul deleted the snowplow-results branch May 29, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snowplow results#37

Snowplow results#37
JanKaul merged 10 commits into
mainfrom
snowplow-results

JanKaul commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JanKaul commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant