-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][Writes] Add support for writing data file stats #3342
base: master
Are you sure you want to change the base?
[Kernel][Writes] Add support for writing data file stats #3342
Conversation
25f689b
to
b580d41
Compare
Will fix the tests. |
b580d41
to
dd88aa2
Compare
a8e9652
to
2594130
Compare
2594130
to
2c73a99
Compare
f558796
to
1f61a59
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Also it would be good if we can add an integration tests and make sure it is readable/usable in Kernel/Delta-Spark read tests.
* @param name the column name to append | ||
* @return the new column | ||
*/ | ||
public Column append(String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is subField
or nestedField
a good name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, makes sense, updated to appendNestedField
.collect(Collectors.toSet()); | ||
|
||
// For now, only support the first numIndexedCols columns | ||
return TransactionStateRow.getLogicalSchema(engine, transactionState).fields().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the fields
traverse through the nested struct fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the DATA_SKIPPING_NUM_INDEXED_COLS refers to the number of columns at the leaf level or top level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the top level columns, looking at the code, fields
does seem to only traverse top level columns but i'll double check
public static final String DATA_SKIPPING_NUM_INDEXED_COLS = "delta.dataSkippingNumIndexedCols"; | ||
public static final int DEFAULT_DATA_SKIPPING_NUM_INDEXED_COLS = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be defined in the TableConfig
return; | ||
} | ||
for (StructField field : schema.fields()) { | ||
Column colPath = parentColPath.append(field.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can actually just mantain the path as a list and we can avoid the extra API on the Column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah felt like that'd just be extra state to maintain and tradeoff readability a bit; the API could come in handy in other places as well perhaps. Lmk if you feel strongly and I can update
generator.writeEndObject(); | ||
} else { | ||
T value = values.get(colPath); | ||
if (value != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to type check the Literal matches the data type to make sure to not write incorrect stats
} else if (type instanceof FloatType) { | ||
generator.writeNumber(((Number) value).floatValue()); | ||
} else if (type instanceof DoubleType) { | ||
generator.writeNumber(((Number) value).doubleValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets double check the NaN and infinity are written correctly.
@@ -0,0 +1,66 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be an internal right? I think there is one already in the internal/JsonUtils.java
} | ||
|
||
@FunctionalInterface | ||
public interface ToJson { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this interface for?
new Column("IntegerType") -> Literal.ofInt(1), | ||
new Column("LongType") -> Literal.ofLong(1L), | ||
new Column("FloatType") -> Literal.ofFloat(0.1f), | ||
new Column("DoubleType") -> Literal.ofDouble(0.1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add all variants of the double: NaN, infinity. Same for float.
new Column("TimestampNTZType") -> Literal.ofTimestampNtz(1L), | ||
new Column("BinaryType") -> Literal.ofBinary("a".getBytes), | ||
new Column(Array("NestedStruct", "aa")) -> Literal.ofString("a"), | ||
new Column(Array("NestedStruct", "ac", "aca")) -> Literal.ofInt(1)).asJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null as stats values, null literal
Which Delta project/connector is this regarding?
Description
Serializes stats to the data file on writes.
How was this patch tested?
Unit tests.
Does this PR introduce any user-facing changes?
No