feat(dataframe): expose the executed physical plan with per-operator metrics#97
Open
LantaoJin wants to merge 1 commit into
Open
feat(dataframe): expose the executed physical plan with per-operator metrics#97LantaoJin wants to merge 1 commit into
LantaoJin wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
df.explain(true, true)already runs the plan and attaches per-operator metrics, but the result is aDataFrameof text rows. Programmatic consumers — query-shape regression tests, operational audit feeds, build-time benchmarks — have to scrapeoutput_rows=12345, elapsed_compute=4.2msstrings out of those rows. Brittle to upstream wording, ergonomically painful, and the typed metric values (Count,Time,Gauge) lose their type along the way.This PR adds a typed accessor
df.executedPlan()that returns an immutableExecutedPlantree once the DataFrame has been executed viacollect()/executeStream(). Each node carries the operator name, a one-line display rendering, child nodes, and anOperatorMetricsPOJO withOptionalLongfields for the well-known metric variants plus aMap<String, Long>for custom counters.The contract is post-mortem:
executedPlan()requires a priorcollect/executeStreamand rejects withIllegalStateException("call collect() or executeStream() first")if called pre-execution. A future PR can extend the surface to make pre-execution structure inspection available too — that follow-up is intentionally out of scope here to keep this PR focused on the metric-snapshot surface.What changes are included in this PR?
ExecutedPlanandOperatorMetrics.DataFrame.executedPlan()method.proto/executed_plan.proto(ExecutedPlanNodeProto).executed_plan.rs.final long planIdfield assigned at construction.Out of scope (deferred to follow-up PRs):
Time/Gauge-shaped custom metrics; v1 surfacesCount-shaped customs only.Are these changes tested?
Yes. 10 new tests in the
ExecutedPlanTest.Are there any user-facing changes?
Yes, additive only -- no behavior changes for existing callers.
ExecutedPlanandOperatorMetrics(records).DataFrame.executedPlan()method.