Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix querying and defining table / view names with period #4530

Merged
merged 7 commits into from
Dec 7, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 6, 2022

Which issue does this PR close?

Closes #4513
Closes #4531

This is really not that large of a PR

This looks like a massive change but a substantial amount of it is generated protobuf code

Rationale for this change

See more details on #4513

TLDR is that this SQL:

select * from "foo.bar";
Plan("failed to resolve schema: foo")

Should not be treating "foo" as a schema name

The issue is that the SQL planner was flattening multi part identifiers like ['"foo"', 'bar'] into Strings like '"foo".bar' and then the strings were being parsed by https://github.com/apache/arrow-datafusion/blob/740a4fa2c6ba4b85875a433bb86e5b00435a5969/datafusion/common/src/table_reference.rs#L97 which does not parse them properly (instead it simply splits them on . rather than handling " properly)

What changes are included in this PR?

Changes:

  1. Add OwnedTableReference, and change the planner and relevant DML plan nodes to use that rather than a String
  2. Add docstrings in impl<'a> From<&'a str> for TableReference<'a> { to document its limitations.
  3. Added protobuf serialization support
  4. Add new .slt tests for same

The correct way to parse strings to references is to run through the SQL parser and our planner, but this API seems to be a fairly large part of the public API. I filed a follow on ticket to track properly supporting name resolution via the API: #4532

Are these changes tested?

Yes (both extensively with existing tests as well as new tests)

Are there any user-facing changes?

Better support for names with periods

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Dec 6, 2022
@@ -206,7 +206,7 @@ impl DFSchema {
(Some(qq), None) => {
// the original field may now be aliased with a name that matches the
// original qualified name
let table_ref: TableReference = field.name().as_str().into();
let table_ref = TableReference::parse_str(field.name().as_str());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is correct as it potentially splits on "." -- but I chose to keep the existing behavior

@@ -1058,7 +1060,7 @@ pub struct CreateCatalogSchema {
#[derive(Clone)]
pub struct DropTable {
/// The table name
pub name: String,
pub name: OwnedTableReference,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change -- previously in DataFusion relation (table, view, etc) were stored as String which was then "parsed" (really split on .). This PR changes the code so the original parsed multi-part TableReference is used instead

@@ -915,6 +917,29 @@ message StringifiedPlan {
string plan = 2;
}

message BareTableReference {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the corresponding serialization support for this feature.

@@ -542,3 +542,11 @@ pub(crate) fn normalize_ident(id: &Ident) -> String {
None => id.value.to_ascii_lowercase(),
}
}

// Normalize an owned identifier to a lowercase string unless the identifier is quoted.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the non-owned variant in a follow on PR -- but I was trying to minimize the size of this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -262,7 +264,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
};

Ok(LogicalPlan::CreateMemoryTable(CreateMemoryTable {
name: name.to_string(),
name: object_name_to_table_reference(name)?,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than encoding the name in a string, it is now encoded as a reference

0 => Err(ParserError("Missing table name.".to_string()).into()),
1 => object_name_to_table_reference(names.pop().unwrap()),
_ => {
Err(ParserError("Multiple objects not supported".to_string())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously if multiple objects were specified, the second was ignored: #4531

@@ -592,6 +602,9 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {

let schema = self.build_schema(columns)?;

// External tables do not support schemas at the moment, so the name is just a table name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parser returns a String not an ObjectIdentifier, so I kept the behavior the same

}))
}
}
// only support "schema.table" type identifiers here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has the same behavior as master but I think the logic is clearer

// First, use sqlparser to get table name and insert values
let table_name;
let table_reference;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the test harness to use the new name handling as well

@alamb alamb added the api change Changes the API exposed to users of the crate label Dec 6, 2022
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed insert.slt and appended to it.

Writing tests using sqllogictest is life changing . Thank you @mvanschellebeeck and @xudong963 for driving this work forward!

@alamb alamb marked this pull request as ready for review December 6, 2022 20:22
@alamb alamb requested review from xudong963 and andygrove December 6, 2022 20:25
@alamb
Copy link
Contributor Author

alamb commented Dec 6, 2022

I am sorry to request a large PR code review @xudong963 and @andygrove but it is mostly tests and generated code.

cc @returnString in case you are around as you wrote the initial TableReference implementation

@alamb
Copy link
Contributor Author

alamb commented Dec 6, 2022

cc @NGA-TRAN in case you are interested

Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR does what described

##########

# Verify that creating tables with periods in their name works
# (note "foo.bar" is the table name, NOT table "bar" in schema "foo")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the period in a table name only works if the name is in double quotes, right? Without double quotes. foo will be treated as a schema?
Is this "foo.bar.baz" a valid table name? can we add test for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without double quotes. foo foo will be treated as a schema?

Yes

Is this "foo.bar.baz" a valid table name? can we add test for it?

It is a valid name. Added tests in c03cf2b -- thank you for the suggestion

@xudong963
Copy link
Member

I'll review it carefully in the evening.

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this file to ddl.slt and added some more tests, but the github UI doesn't seem to recognize that

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket makes sense to me, thanks @alamb

FYI, databend processes the case in parser: catalog, database, and table are separated in parser

// Table name
Table {
    span: &'a [Token<'a>],
    catalog: Option<Identifier<'a>>,
    database: Option<Identifier<'a>>,
    table: Identifier<'a>,
    alias: Option<TableAlias<'a>>,
    travel_point: Option<TimeTravelPoint<'a>>,
},

/// Parse one two three idents seperated by a period, fulfilling from the right.
///
/// Example: `db.table.column`
#[allow(clippy::needless_lifetimes)]
pub fn peroid_separated_idents_1_to_3<'a>(
    i: Input<'a>,
) -> IResult<'a, (Option<Identifier>, Option<Identifier>, Identifier)> {
    map(
        rule! {
            #ident ~ ("." ~ #ident ~ ("." ~ #ident)?)?
        },
        |res| match res {
            (ident2, None) => (None, None, ident2),
            (ident1, Some((_, ident2, None))) => (None, Some(ident1), ident2),
            (ident0, Some((_, ident1, Some((_, ident2))))) => (Some(ident0), Some(ident1), ident2),
        },
    )(i)
}

@alamb
Copy link
Contributor Author

alamb commented Dec 7, 2022

FYI, databend processes the case in parser: catalog, database, and table are separated in parser

Makes sense @xudong963 -- this is basically what sqlparser-rs does as well (parses into Vec<Ident>). Thank you for the pointer.

@alamb
Copy link
Contributor Author

alamb commented Dec 7, 2022

I am merging this one in as I have a few PRs backed up behind it. Thanks again @NGA-TRAN and @xudong963 for your quick review.

@alamb alamb merged commit fbadebb into apache:master Dec 7, 2022
@ursabot
Copy link

ursabot commented Dec 7, 2022

Benchmark runs are scheduled for baseline = 4c1f60c and contender = fbadebb. fbadebb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
4 participants