Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DataFrame::with_column to handle creating column names with a period #3700

Merged
merged 3 commits into from
Oct 7, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 4, 2022

Which issue does this PR close?

fixes #3682

Rationale for this change

What changes are included in this PR?

  1. Changes: do not use col(), which tries to parse strings as identifiers (e.g. splits on .)

Are there any user-facing changes?

One fewer bug

@github-actions github-actions bot added the core Core DataFusion crate label Oct 4, 2022
@@ -1378,6 +1383,43 @@ mod tests {
Ok(())
}

#[tokio::test]
async fn with_column_name() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @hengfeiyang for this reproducer

// try and create a column with a '.' in it
.with_column("f.c2", lit("hello"))
.unwrap();
// Note trying to select causes an error (todo file a separate ticket)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns it something still isn't right (select doesn't work) -- I plan to file an issue (or maybe someone else will have the time to debug it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hengfeiyang
Copy link
Contributor

hengfeiyang commented Oct 7, 2022

@alamb i tested with #3733 , it looks with_column still can't work.

i tested use config:

Cargo.toml

datafusion = { git = "https://github.com/apache/arrow-datafusion" }

with the example code:

use std::sync::Arc;

use datafusion::arrow::array::Int32Array;
use datafusion::arrow::datatypes::{DataType, Field, Schema};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::datasource::MemTable;
use datafusion::error::Result;
use datafusion::from_slice::FromSlice;
use datafusion::prelude::{lit, SessionContext};

/// This example demonstrates how to use the DataFrame API against in-memory data.
#[tokio::main]
async fn main() -> Result<()> {
    // define a schema.
    let schema = Arc::new(Schema::new(vec![Field::new("f.c", DataType::Int32, false)]));

    // define data.
    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![Arc::new(Int32Array::from_slice([1, 10, 10, 100]))],
    )?;

    // declare a new context. In spark API, this corresponds to a new spark SQLsession
    let ctx = SessionContext::new();

    // declare a table in memory. In spark API, this corresponds to createDataFrame(...).
    let provider = MemTable::try_new(schema.clone(), vec![vec![batch]])?;
    ctx.register_table("t", Arc::new(provider))?;
    let df = ctx.table("t")?;
    let df = df.with_column("new_field", lit("hi"))?;

    // print the results
    df.show().await?;

    Ok(())
}

Result:

Error: SchemaError(FieldNotFound { qualifier: Some("f"), name: "c", valid_fields: Some(["t.f.c"]) })

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2022

col("f.c").eq(lit(10));

You can't use col with a qualified name (with a '.')

Does it work if you use

 Expr::Column(Column {
                        relation: None,
                        name: "f.c".into(),
                    })

instead of col("f.c:)

?

If so we should probably offer an alternate to col that doesn't try and parse the relation that is not as wordy 🤔

@hengfeiyang
Copy link
Contributor

Sorry for that, i updated my test code above, I use with_column, this case still can't work.

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2022

Sorry for that, i updated my test code above, I use with_column, this case still can't work.

I will merge this PR and then try your example

@alamb alamb merged commit 1528961 into apache:master Oct 7, 2022
@alamb alamb deleted the alamb/fix_with_column branch October 7, 2022 19:54
@ursabot
Copy link

ursabot commented Oct 7, 2022

Benchmark runs are scheduled for baseline = 1d634e5 and contender = 1528961. 1528961 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2022

@hengfeiyang I have merged this PR and now when I run your example from #3700 (comment) it produces this output:

+-----+-----------+
| f.c | new_field |
+-----+-----------+
| 1   | hi        |
| 10  | hi        |
| 10  | hi        |
| 100 | hi        |
+-----+-----------+

@hengfeiyang
Copy link
Contributor

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Field names containing periods such as f.c cannot work
3 participants