Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Writes] Add support for writing data file stats #3342

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

raveeram-db
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Serializes stats to the data file on writes.

How was this patch tested?

Unit tests.

Does this PR introduce any user-facing changes?

No

@raveeram-db raveeram-db force-pushed the AddStatsCollectionDuringWrite branch from 25f689b to b580d41 Compare August 5, 2024 06:07
@raveeram-db
Copy link
Collaborator Author

Will fix the tests.

@raveeram-db raveeram-db force-pushed the AddStatsCollectionDuringWrite branch from b580d41 to dd88aa2 Compare September 18, 2024 06:20
@raveeram-db raveeram-db force-pushed the AddStatsCollectionDuringWrite branch from a8e9652 to 2594130 Compare February 14, 2025 17:05
@raveeram-db raveeram-db force-pushed the AddStatsCollectionDuringWrite branch from 2594130 to 2c73a99 Compare February 24, 2025 07:34
@raveeram-db raveeram-db force-pushed the AddStatsCollectionDuringWrite branch from f558796 to 1f61a59 Compare February 24, 2025 07:56
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Also it would be good if we can add an integration tests and make sure it is readable/usable in Kernel/Delta-Spark read tests.

* @param name the column name to append
* @return the new column
*/
public Column append(String name) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is subField or nestedField a good name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, makes sense, updated to appendNestedField

.collect(Collectors.toSet());

// For now, only support the first numIndexedCols columns
return TransactionStateRow.getLogicalSchema(engine, transactionState).fields().stream()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the fields traverse through the nested struct fields?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the DATA_SKIPPING_NUM_INDEXED_COLS refers to the number of columns at the leaf level or top level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the top level columns, looking at the code, fields does seem to only traverse top level columns but i'll double check

Comment on lines +37 to +38
public static final String DATA_SKIPPING_NUM_INDEXED_COLS = "delta.dataSkippingNumIndexedCols";
public static final int DEFAULT_DATA_SKIPPING_NUM_INDEXED_COLS = 32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be defined in the TableConfig

return;
}
for (StructField field : schema.fields()) {
Column colPath = parentColPath.append(field.getName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can actually just mantain the path as a list and we can avoid the extra API on the Column

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah felt like that'd just be extra state to maintain and tradeoff readability a bit; the API could come in handy in other places as well perhaps. Lmk if you feel strongly and I can update

generator.writeEndObject();
} else {
T value = values.get(colPath);
if (value != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to type check the Literal matches the data type to make sure to not write incorrect stats

} else if (type instanceof FloatType) {
generator.writeNumber(((Number) value).floatValue());
} else if (type instanceof DoubleType) {
generator.writeNumber(((Number) value).doubleValue());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets double check the NaN and infinity are written correctly.

@@ -0,0 +1,66 @@
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be an internal right? I think there is one already in the internal/JsonUtils.java

}

@FunctionalInterface
public interface ToJson {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this interface for?

new Column("IntegerType") -> Literal.ofInt(1),
new Column("LongType") -> Literal.ofLong(1L),
new Column("FloatType") -> Literal.ofFloat(0.1f),
new Column("DoubleType") -> Literal.ofDouble(0.1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add all variants of the double: NaN, infinity. Same for float.

new Column("TimestampNTZType") -> Literal.ofTimestampNtz(1L),
new Column("BinaryType") -> Literal.ofBinary("a".getBytes),
new Column(Array("NestedStruct", "aa")) -> Literal.ofString("a"),
new Column(Array("NestedStruct", "ac", "aca")) -> Literal.ofInt(1)).asJava
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null as stats values, null literal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants