-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataFrame fill_null #14769
base: main
Are you sure you want to change the base?
Add DataFrame fill_null #14769
Conversation
datafusion/core/src/dataframe/mod.rs
Outdated
.iter() | ||
.map(|name| { | ||
schema.field_with_name(None, name).cloned().map_err(|_| { | ||
DataFusionError::Plan(format!("Column '{}' not found", name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusionError::Plan(format!("Column '{}' not found", name)) | |
plan_datafusion_err!("Column '{}' not found", name)) |
datafusion/core/src/dataframe/mod.rs
Outdated
pub fn fill_null( | ||
&self, | ||
value: ScalarValue, | ||
columns: Option<Vec<String>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we wrap into the Option here? empty vec already serves as None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Thanks for spotting this.
datafusion/core/src/dataframe/mod.rs
Outdated
) -> Result<DataFrame> { | ||
let cols = match columns { | ||
Some(names) => self.find_columns(&names)?, | ||
None => self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no cols set should we just no op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the Pandas convention, where no limits (columns) means fill_null for all columns
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
data:image/s3,"s3://crabby-images/6f01e/6f01ec9b6c24395281402c2206e1fdfd637d8603" alt="image"
I could amend to no op as well if this is the preferred convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kosiew looks great some minor comments
I'm surprised though we dont have a documentation with DataFrame API
its documented in https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html not in the code itself |
I added fill_null example usage in dataframe/mod.rs for rustdoc to update https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html |
Which issue does this PR close?
Rationale for this change
The fill_null operation is a common requirement in data processing frameworks like PySpark, where users need to replace null values across multiple columns efficiently. Adding a fill_null function to DataFusion and datafusion-python provides a convenient way to perform this operation without requiring complex expressions such as coalesce or manual conditional statements.
This change improves usability and aligns DataFusion's feature set more closely with other popular data processing frameworks.
What changes are included in this PR?
Introduced a new fill_null function in DataFrame that replaces null values in selected columns or all columns if none are specified.
Ensured type safety by only allowing replacements that can be cast to the respective column's type.
Implemented a fallback mechanism where columns remain unchanged if the provided value cannot be cast to their type.
Added helper function find_columns to validate column existence.
Included comprehensive test cases for fill_null, verifying behavior for both single-column and all-column replacements.
Are these changes tested?
Yes, the following test cases have been added:
test_fill_null: Verifies the ability to replace null values in specific columns with the provided values.
test_fill_null_all_columns: Ensures the function works correctly when no column list is provided, replacing nulls in all columns where casting is possible.
Tests confirm that invalid casts do not modify the original column values.
Are there any user-facing changes?
Yes, this PR introduces a new fill_null method for DataFrames, allowing users to efficiently replace null values in their datasets. This enhances usability and streamlines null handling within DataFusion.
There are no breaking changes to existing APIs.