Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame fill_null #14769

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Add DataFrame fill_null #14769

wants to merge 9 commits into from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Feb 19, 2025

Which issue does this PR close?

Rationale for this change

The fill_null operation is a common requirement in data processing frameworks like PySpark, where users need to replace null values across multiple columns efficiently. Adding a fill_null function to DataFusion and datafusion-python provides a convenient way to perform this operation without requiring complex expressions such as coalesce or manual conditional statements.

This change improves usability and aligns DataFusion's feature set more closely with other popular data processing frameworks.

What changes are included in this PR?

Introduced a new fill_null function in DataFrame that replaces null values in selected columns or all columns if none are specified.

Ensured type safety by only allowing replacements that can be cast to the respective column's type.

Implemented a fallback mechanism where columns remain unchanged if the provided value cannot be cast to their type.

Added helper function find_columns to validate column existence.

Included comprehensive test cases for fill_null, verifying behavior for both single-column and all-column replacements.

Are these changes tested?

Yes, the following test cases have been added:

test_fill_null: Verifies the ability to replace null values in specific columns with the provided values.

test_fill_null_all_columns: Ensures the function works correctly when no column list is provided, replacing nulls in all columns where casting is possible.

Tests confirm that invalid casts do not modify the original column values.

Are there any user-facing changes?

Yes, this PR introduces a new fill_null method for DataFrames, allowing users to efficiently replace null values in their datasets. This enhances usability and streamlines null handling within DataFusion.

There are no breaking changes to existing APIs.

@github-actions github-actions bot added the core Core DataFusion crate label Feb 19, 2025
.iter()
.map(|name| {
schema.field_with_name(None, name).cloned().map_err(|_| {
DataFusionError::Plan(format!("Column '{}' not found", name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFusionError::Plan(format!("Column '{}' not found", name))
plan_datafusion_err!("Column '{}' not found", name))

pub fn fill_null(
&self,
value: ScalarValue,
columns: Option<Vec<String>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we wrap into the Option here? empty vec already serves as None

Copy link
Contributor Author

@kosiew kosiew Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.
Thanks for spotting this.

) -> Result<DataFrame> {
let cols = match columns {
Some(names) => self.find_columns(&names)?,
None => self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no cols set should we just no op?

Copy link
Contributor Author

@kosiew kosiew Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the Pandas convention, where no limits (columns) means fill_null for all columns
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

image

I could amend to no op as well if this is the preferred convention

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew looks great some minor comments

I'm surprised though we dont have a documentation with DataFrame API

@comphead
Copy link
Contributor

Thanks @kosiew looks great some minor comments

I'm surprised though we dont have a documentation with DataFrame API

its documented in https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html not in the code itself

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 24, 2025
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Feb 25, 2025
@kosiew
Copy link
Contributor Author

kosiew commented Feb 25, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DataFrame fill_null
2 participants