Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve codebase readability and error messages by and consistently handle downcasting #3152

Closed
alamb opened this issue Aug 15, 2022 · 7 comments · Fixed by #4588
Closed
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@alamb
Copy link
Contributor

alamb commented Aug 15, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It is a common code pattern in DataFusion to downcast an ArrayRef to a concrete arrow array type.

There are actually at least three patterns to do so in the datafusion codebase https://github.com/apache/arrow-datafusion/search?l=Rust&q=downcast

I think this leads to

  1. Poor user experience (as a DataFusion bug might panic and quit the process, which is impolite)
  2. less readable code
  3. less efficient feature development and review (as people have to figure out which of the three patterns to use and they can't just look at what convention is used in the rest of the codebase

Pattern 1: downcast_ref()

let array = array.as_any().downcast_ref::<Date32Array>().unwrap();

panics on error

Pattern 2: use as_XXX_array functions from arrow

let array = as_boolean_array(&array);

panics on error

Pattern 3: check type and return an error:

let array = array.as_any().downcast_ref::<Date32Array>().ok_or(
    DataFusionError::Execution("Cast Date32Array error".to_string()),
)?;

Returns an error to the user (a better behavior I think)

Describe the solution you'd like
I would like to use one consistent pattern throughout the codebase. Specifically I propose functions that do the error checking / error message generation:

/// Return `array` cast to `Int32Array` otherwise an error
pub fn as_int32_array(array: &dyn Array) -> Result<&Int32Array, DataFusionError> {
  array.as_any().downcast_ref::<Int32Array>().ok_or(
    DataFusionError::Execution(format!("Expected a Int32Array, got: ", array.data_type()),
  )
}

/// And similar functions for the remaining array types

And then change all dynamic casts in the codebase:

  1. If they are fallable, to use the new functions that make a nice error message
  2. If they are infallible / somewhere were passing out the result is hard, use arrow functions

Describe alternatives you've considered
Leave the status quo

Additional context
Inspired by @avantgardnerio 's suggestion https://github.com/apache/arrow-datafusion/pull/3110/files#r945847871 (but this is not the first time it has come up)

@alamb alamb added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Aug 15, 2022
@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2022

cc @avantgardnerio and @jimexist and @Dandandan

@alamb alamb changed the title Improve codebase readability and consistently handle downcasting Improve codebase readability and error messages by and consistently handle downcasting Aug 15, 2022
@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2022

I think this is a good first issue because it will be largely mechanical and will give potential contributors wide exposure to the codebase

@waitingkuo
Copy link
Contributor

I think this is a good first issue because it will be largely mechanical and will give potential contributors wide exposure to the codebase

this is attractive 👍 let me try this

@retikulum
Copy link
Contributor

Hi. While this issue is getting old, I might help but have some questions. . I will give as_int32_array as a generic example so you can think of those questions for all other types.

  • In which file we should implement as_int32_array function? I have implemented it in datafusion\common\src\lib.rs but I am not quite sure about the architecture.
  • Before opening pr, I want to be sure that this sample is implemented correctly. I added this function:
pub fn as_int32_array(array: &dyn Array) -> Result<&Int32Array> {
    array.as_any().downcast_ref::<Int32Array>().ok_or(
      DataFusionError::Execution(format!("Expected a Int32Array, got: {}", array.data_type()),
    ))
}

And change this:
https://github.com/apache/arrow-datafusion/blob/10e64dc013ba210ab1f6c2a3c02c66aef4a0e802/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L851-L854

To this:

let arg0 = as_int32_array(&args[0])?;

Is this the correct way? simplify_expressions::test_evaluator_udfs test runs successfully after implementation.

@alamb
Copy link
Contributor Author

alamb commented Oct 26, 2022

In which file we should implement as_int32_array function? I have implemented it in datafusion\common\src\lib.rs but I am not quite sure about the architecture.

I think that is reasonable. Or perhaps something like datafusion\common\src\cast.rs ?

@alamb
Copy link
Contributor Author

alamb commented Oct 26, 2022

Before opening pr, I want to be sure that this sample is implemented correctly. I added this function:

Thanks @retikulum -- that looks good to me

What I would suggest is doing this in a series of smaller PRs -- perhaps add a few functions like as_int32_array, as_uint32_array, etc and consolidate some of the uses.

That way we can then get feedback from the wider community before you spend a large amount of your time changing the entire codebase

@retikulum
Copy link
Contributor

In which file we should implement as_int32_array function? I have implemented it in datafusion\common\src\lib.rs but I am not quite sure about the architecture.

I think that is reasonable. Or perhaps something like datafusion\common\src\cast.rs ?

datafusion\common\src\cast.rs is my first idea but I just don't want to create a new file :) cast fits better.

Before opening pr, I want to be sure that this sample is implemented correctly. I added this function:

Thanks @retikulum -- that looks good to me

What I would suggest is doing this in a series of smaller PRs -- perhaps add a few functions like as_int32_array, as_uint32_array, etc and consolidate some of the uses.

That way we can then get feedback from the wider community before you spend a large amount of your time changing the entire codebase

This is exactly what I planned. Moreover, I will not get lost in the ocean. Thanks for feedback @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants