[Parquet][C++] Add experimental VECTOR repetition level for Arrow FixedSizeList#50
[Parquet][C++] Add experimental VECTOR repetition level for Arrow FixedSizeList#50rok wants to merge 3 commits into
Conversation
e12309e to
d660076
Compare
470692a to
ada6b22
Compare
| /// \brief The fixed number of values per parent record contributed by this | ||
| /// column, taken from the nearest VECTOR-repeated ancestor (or the leaf | ||
| /// itself if it is VECTOR-repeated). Returns -1 when the column is not part | ||
| /// of a VECTOR-repeated subtree. |
There was a problem hiding this comment.
Would it be valid at some point to have a sechema like vector<vector<int>>, and then the effective length would be the multiple of the two vector lengths?
| auto table = | ||
| ::arrow::Table::Make(::arrow::schema({field("root", type, false)}), {array}); |
There was a problem hiding this comment.
Should this and other similar tests be re-using the MakeVectorFixedSizeListTable helper above?
| std::vector<NodePtr> parquet_fields; | ||
| std::vector<std::shared_ptr<Field>> arrow_fields; | ||
|
|
||
| auto element = PrimitiveNode::Make("element", Repetition::VECTOR, ParquetType::FLOAT, |
There was a problem hiding this comment.
Rather than support vector repetition directly on a primitive, should we enforce that it should always use a three-level schema and require use of a new VECTOR or FixedSizeList logical type? The Parquet format allows using a single node with repetition::repeated to represent a list for backwards compatibility but encourages use of the three-level schema. We don't have to worry about backwards compatibility here though.
| ASSERT_RAISES( | ||
| NotImplemented, | ||
| ConvertSchema({::arrow::field("embedding", arrow_list, true)}, builder.build())); |
There was a problem hiding this comment.
What if we have a table with an Arrow fixed-size list of string and a fixed-size list of int? We should allow using the normal list encoding for the string list and vector for the int list rather than requiring that all fixed size lists can use the vector encoding if enable_experimental_vector_encoding is used.
| records_to_read += | ||
| reader_->metadata()->RowGroup(row_group)->ColumnChunk(i)->num_values(); | ||
| if (is_vector) { | ||
| records_to_read += reader_->metadata()->RowGroup(row_group)->num_rows(); |
There was a problem hiding this comment.
This change is a bit surprising to me, I would have thought that the vector path would still use the same code as in the else branch where records_to_read is the total number of elements (rows * vector length), which would match the behaviour of variable size list data.
This PR prototypes a new experimental Parquet repetition type, VECTOR, mainly for Arrow's
FixedSizeList<T, N>as proposed in Option B here.Today this is written through Parquet LIST, which means we pay repetition/definition-level costs for an inner shape that is already fixed. VECTOR stores the fixed length in the Parquet schema as
vector_lengthand avoids the inner repetition level.Writing
FixedSizeList<T, N>as VECTOR is opt-in viaArrowWriterProperties::Builder::enable_experimental_vector_encoding()defaulting to LIST otherwise.Implemented in this prototype
FixedSizeList<struct<x: float, y: int32>, N>Deferred for now:
Compatibility
We expect non-VECTOR-aware readers to fail when encountering VECTOR repetition level.
Benchmark snapshot
Both VECTOR and LIST use identical WriterProperties (dictionary, statistics and page-index disabled, PLAIN encoding) — the only intended difference is enable_experimental_vector_encoding.
Numbers below are from a local release build on m4 mac, so directional only.
Required
FixedSizeList<float, N>: FixedSizeList encoded as LIST vs VECTORAll with
FixedSizeList<float, {80,768,10k}>.Realistic embedding row:
(int64 id, timestamp ts, int32 category, float score, string label, FSL<float, N> embedding)