Skip to content

Snowplow results#37

Merged
JanKaul merged 10 commits into
mainfrom
snowplow-results
May 29, 2026
Merged

Snowplow results#37
JanKaul merged 10 commits into
mainfrom
snowplow-results

Conversation

@JanKaul
Copy link
Copy Markdown
Collaborator

@JanKaul JanKaul commented May 29, 2026

add snowplow tests

JanKaul and others added 10 commits May 28, 2026 16:18
Align the sqllogictest output with the snowflake-connector-python output
that the bronze_scope .slt fixtures were generated against:

* booleans as TRUE/FALSE
* empty strings as '' (not (empty))
* floats with lowercase nan/inf/-inf and .0 preserved on whole values
* binary as x'<lowercase-hex>'
* date/time/timestamp/list/struct/map cells wrapped in single quotes
* Utf8 cells whose content parses as a JSON array or object (VARIANT /
  ARRAY / OBJECT in Rustice, all stored as Utf8) re-emitted as compact
  JSON wrapped in single quotes

Workspace sqllogictest pass count goes from 58 to 116 with no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
* Decimals: preserve the column's declared scale (don't strip trailing
  zeros via BigDecimal normalize), so `DECIMAL(p,6)` values render as
  `77500.000000`.
* Floats: use Rust's shortest-round-trip `{:?}` (same algorithm Python's
  `repr(float)` uses), so values like `1.23e-10` keep scientific
  notation. Fix the exponent sign to Python's `e+N` form. NaN / Infinity
  / -Infinity rendered as lowercase `nan` / `inf` / `-inf`.
* Time / Timestamp: pad the subsecond fraction to 6 digits and truncate
  nanosecond precision, matching Python `datetime.isoformat()`.
* VARIANT JSON: sort object keys alphabetically (serde_json's Map has
  `preserve_order` on transitively; emit through a recursive sort).
* Strings: escape embedded `\n` as the literal `\n` (matches the python
  runner's `value.replace('\n', '\\n')`).
* Validator: normalize the expected line too (and join actual cells
  with `" "` instead of `\t`), so the literal `\t` the .slt parser
  embeds between columns equates to the whitespace runs the upstream
  normalizer collapses.

Workspace pass count: 116 -> 155.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
* `array_value_to_json` walks the Arrow value at (col, row) and
  produces a `serde_json::Value`, honouring primitive types, strings,
  nested lists, struct fields, and map entries.
* Struct fields and Map entries are emitted in alphabetical key order
  to match Snowflake's VARIANT JSON output (which the bronze_scope
  fixtures were captured against).
* Arrow's default formatter previously rendered List<Utf8> as
  `[a, b, c]` (debug form) — a far cry from `["a","b","c"]`. This
  lights up SPLIT, STRTOK_TO_ARRAY, and other UDFs that return
  array/object cells.

* Tighten the validator's whitespace handling: a cell that's only
  whitespace was normalising to an empty string but still emitting a
  separator on join, so a `["   ", "0"]` row produced `" 0"` against
  an expected `"0"`. Re-normalise the joined row to collapse the stray
  spaces.

Workspace pass count: 155 -> 157.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…REGEX>:

Three transforms the Python slt_runner applies that the Rust harness was
missing:

1. `varchar_to_str` now substitutes `undefined` → `null` before JSON-parsing
   a VARIANT-shaped cell, mirroring `slt_runner/result.py:550`.

2. The `<REGEX>:` branch in `embucket_validator` now uses fullmatch (\A..\z)
   and DOTALL ((?s)) semantics, matching `slt_runner/result.py:422-424`.

3. `<!REGEX>:` (negative-match) is now recognised alongside `<REGEX>:`.

Unit tests cover all three. Net effect on suite: pass 157→158, fail 220→219.
…elies on

Six focused .slt files, one per DDL operation:

  - create_schema_idempotent.slt — CREATE SCHEMA IF NOT EXISTS reruns cleanly
  - create_or_replace_table.slt  — CREATE OR REPLACE TABLE (typed and AS SELECT)
                                   actually drops & recreates, no residual rows
  - create_table_as_then_insert.slt — Phase A pattern: CTAS + INSERT INTO. The
                                   trailing case documents that when CTAS body
                                   and INSERT body both return the same rows
                                   you get 2x duplicates (engine-correct).
  - drop_then_create.slt         — Phase B pattern: DROP + fresh CREATE TABLE AS
  - copy_into_parquet.slt        — COPY INTO appends; events1 + events2 = 400
  - merge_upsert.slt             — MERGE INTO with key + window predicate; both
                                   WHEN MATCHED (update in place) and WHEN NOT
                                   MATCHED (insert once) branches verified

All 6 pass. The DDL primitives all behave correctly — including
CREATE OR REPLACE TABLE and MERGE INTO — which scopes the snowplow
duplicate-row failures to the data-shape (CTAS body + INSERT body overlap)
rather than the DDL itself.
The regen script bundled run-1 (sentinel-gated CTAS) and run-2
(incremental INSERT or __dbt_tmp+MERGE) into a single Phase A block.
This populated events_this_run mid-Phase A, which then cascaded into
2x duplicates in every downstream scratch table (their CTAS bodies have
no date filter and pull from events_this_run; their accompanying
Phase A INSERTs doubled the same rows). Derived `+materialized:
incremental` models accumulated a third layer because the Phase A MERGE
ran against a stale April-2026 window predicate on a target the same
Phase A had just pre-populated, so every source row fell through
`WHEN NOT MATCHED → INSERT`.

The setup now emits TWO files, one per simulated dbt run state, so
both events1-only and events1+events2 verifications run independently
against captured Snowflake reference values:

  setup.full_refresh.slt = header + Phase A + Phase B
                           → state after `load events1.csv; dbt run`.
                             Phase A lays down 18 empty schemas; Phase B
                             rebuilds scratch via CREATE OR REPLACE on
                             the events1-only events table and upserts
                             derived via __dbt_tmp + MERGE on the empty
                             target (every row → WHEN NOT MATCHED →
                             INSERT). Validated by 18 leaves under
                             full_refresh/ against captured Snowflake
                             values in slt_results.full_refresh.txt.
  setup.slt              = header + Phase A + events2 COPY + Phase B
                           → state after `load events2.csv; dbt run`
                             on top of the prior state. Same Phase A +
                             Phase B shape, but events2 is COPYed into
                             the events table before Phase B runs.
                             Validated by 18 leaves under incremental/
                             against slt_results.incremental.txt.

Single MERGE pass on an empty target is the only path that reproduces
the captured Snowflake values without depending on a compile-time-
aligned window: the dbt-compiled MERGE predicate is a 2-minute slice
baked in at compile time, so a literal two-cycle simulation (cycle 1
MERGE on empty target then cycle 2 re-MERGE on populated target) would
produce duplicate derived rows.

Regen-script details:
- emit_phase_a stripped to CTAS-only.
- emit_phase_b switched from DROP+CREATE pairs to CREATE OR REPLACE.
- MERGE source piped through sed to map SNOWPLOW_JAN.snowplow_manifest_*
  catalog refs to embucket.public_snowplow_manifest_* (compiled dbt SQL
  hard-codes the upstream Snowflake catalog) and lowercase the
  Snowflake-style "QUOTED_UPPERCASE_IDENT" column refs (DataFusion
  stores columns lowercase).

Test surface:
- 18 full_refresh leaves restored, then events1-only verification blocks
  pasted from test-dbt-snowplow-web/loader/slt_results.full_refresh.txt
  via the loader's apply_slt_results.py script.
- 18 incremental leaves keep their events1+events2 verification blocks
  from slt_results.incremental.txt.
- snowplow_web_incremental_manifest expected count 12 → 0 in both
  variants; the package's post-hook (which appends a manifest row per
  successful model run) isn't part of our simulation.

dbt_snowplow_web: pass=36 / fail=0.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Collapse the nested if/let chain in varchar_to_str's JSON path into a
single let-chain to satisfy clippy::collapsible-if (denied by clippy::all
in CI).
@JanKaul JanKaul merged commit b62e7c6 into main May 29, 2026
5 checks passed
@JanKaul JanKaul deleted the snowplow-results branch May 29, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant