Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [sc-58279] [core] add tiledb_array_schema_get_enumeration API #5359

Merged
merged 24 commits into from
Nov 1, 2024

Conversation

rroelke
Copy link
Contributor

@rroelke rroelke commented Oct 23, 2024

Story details: https://app.shortcut.com/tiledb-inc/story/58279

In tiledb-rs we have have "property-based testing" strategies which generate arbitrary tiledb schema. We would like to extend these strategies to generate arbitrary enumerations to put in those schemata.

As a step towards this we must add an API tiledb_array_schema_get_enumeration so that we can load the values in an enumeration without having a pointer to the open array in the same function.

It is not required for this API to actually load the enumeration, just to return its contents if it is already loaded. There should be a graceful signal to distinguish between an enumeration not being found and an enumeration not being loaded.

This pull request goes all the way and adds functions tiledb_array_schema_get_enumeration_from_name and tiledb_array_schema_get_enumeration_from_attribute_name which load the contents of an enumeration and return a handle to the user.

See this comment for further discussion.


TYPE: FEATURE | C_API | CPP_API
DESC: add tiledb_array_schema_get_enumeration

@rroelke rroelke requested review from davisp and shaunrd0 October 23, 2024 20:46
Copy link
Contributor

@shaunrd0 shaunrd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left some small comments, looks great so far 👍

tiledb/sm/cpp_api/array.h Show resolved Hide resolved
test/src/unit-cppapi-enumerations.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me. @ypatia already suggested the REST test which is all I had thought of.

tiledb/sm/cpp_api/array_schema_experimental.h Outdated Show resolved Hide resolved
@rroelke rroelke requested review from ypatia, davisp and shaunrd0 October 25, 2024 01:42
@rroelke
Copy link
Contributor Author

rroelke commented Oct 25, 2024

+1 from me. @ypatia already suggested the REST test which is all I had thought of.

@davisp I mis-clicked when I requested a re-review, but feel free :)

All the changes are strictly in response to the review comments.

Copy link
Contributor

@shaunrd0 shaunrd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

…r attribute name, and also use context to do the I/O
@rroelke
Copy link
Contributor Author

rroelke commented Oct 29, 2024

After using this branch in my tiledb-rs branch I was not very satisfied, so instead of debugging the REST failure today I instead spent time making the schema lookup actually load the enumeration (instead of returning data only if the enumeration was already loaded).

While doing so I also noticed that Array::get_enumeration uses the attribute name as the lookup key, not the enumeration name. This does not match what I had done here.

So the most recent commit does the following:

  • Removes tiledb_array_schema_get_enumeration_if_loaded
  • Adds tiledb_array_schema_get_enumeration_from_name
  • Adds tiledb_array_schema_get_enumeration_from_attribute_name
  • Adds the internal method ArraySchema::load_enumeration which does the I/O and updates the ArraySchema internal pointer, and calls this method from the C API handler

I reckon this merits a re-review since it's a pretty big change.

@rroelke rroelke requested a review from shaunrd0 October 29, 2024 02:51
@@ -36,6 +36,7 @@ commence(object_library capi_array_schema_stub)
this_target_sources(${SOURCES})
this_target_link_libraries(export)
this_target_object_libraries(array_schema)
this_target_object_libraries(array)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because tiledb/sm/array_schema/array_schema.cc calls upon tiledb/sm/array/array_directory.cc. So now the unit test needs to link with that.

@rroelke
Copy link
Contributor Author

rroelke commented Oct 30, 2024

@shaunrd0 @davisp @ypatia

I think this is all set, CI passes and I don't intend to make any more changes. I'd recommend re-reviewing as this has drfited a bit since it was approved.

Since the last review I determined I was unsatisfied with the "return the enumeration only if loaded" approach and have updated it to load the enumeration as part of the API request.

The code was easy enough but the troubles come with avoiding circular dependencies. array/array_directory.cc is used to do the I/O, but the array "object library" depends on array_schema - so we can't have array_schema circularly depend on array. The solution to this is to have the C API handler call a separate function to load the enumeration.

The other change since the last review is tweaking the API to be tiledb_array_schema_get_enumeration_from_name and tiledb_array_schema_get_enumeration_from_attribute_name. I noticed that tiledb_array_get_enumeration used the attribute name, not the enumeration name, as the lookup key, and I wanted to disambiguate for these versions. The implementation of tiledb_array_schema_get_enumeration_from_attribute_name calls tiledb_attribute_get_enumeration which required a few more changes to some "object libraries" but I thought this would make sense so that any errors would be presented the same way.

The good news is that my downstream tiledb-rs is very happy with this code and does exactly what I want it to. Hooray!

@ihnorton ihnorton requested review from davisp and removed request for davisp October 31, 2024 18:35
Copy link
Contributor

@shaunrd0 shaunrd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Most comments were nits on updating docs since things have changed a bit since the first pass. I don't think there should be too many changes from my comments but if you want me to take another look let me know. Just as a heads up we should probably update the PR description before merging.

auto tracker = ctx.resources().ephemeral_memory_tracker();
// Pass an empty list of enumeration names. REST will use timestamps to
// load all enumerations on all schemas for the array within that range.
auto ret = rest_client.post_enumerations_from_rest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been great to have avoided the second REST request here, but probably this is the best we can do without changing REST.. I'm ok with this for the time being, #5181 is a WIP and once that's finished this can definitely happen in a single request. Also the rest.load_enumerations_on_array_open option defaults to false so until that work is finished we won't make this second request unless the client asks for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't love this implementation, but... it gets the job done without having any obvious problems

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaunrd0 please add a TODO in your #5181 or its ticket to address this. I don't know of any other place where we do 2 REST requests so we should really treat that as temporary and fix it asap.

tiledb/sm/rest/rest_client_remote.h Outdated Show resolved Hide resolved
@@ -717,6 +717,32 @@ class Array {
return ArraySchema(ctx, schema);
}

/**
* Loads the array schema from an array.
* Options to load additional features are read from the optionally-provided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optionally-provided

}

std::string enumeration_name(enumeration_name_inner->view());
return api_entry_with_context<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do return tiledb_array_schema_get_enumeration_from_name(ctx, array_schema, enumeration_name.c_str(), enumeration); to shorten this a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly I think the answer is "no" without making changes elsewhere.

tiledb_string_t has no c_str(), just view()

and I was initially surprised to see that string_view also has no c_str(), though in retrospect it makes sense.

* @param enmr_name the requested enumeration
* @param schema the target schema
*/
void load_enumeration_into_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this maybe go into array_schema_operations.cc, or did you also run into problems there with dependencies? At a glance it seems like it could work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it won't work there either. I am pretty new to cmake but my impression is that all the array_schema/*.cc files are compiled into the same "object library", hence putting this code in array_schema_operations.cc would also have the circular dependency between array and array_schema


namespace tiledb::test {

bool is_equivalent_enumeration(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this here, I'm likely going to use it when you merge

@rroelke rroelke merged commit aad089c into dev Nov 1, 2024
64 checks passed
@rroelke rroelke deleted the rr/sc-58279-tiledb-array-schema-get-enumeration branch November 1, 2024 00:27
Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was sick Thursday and this went in before I could review it. I am sorry for the delayed comments, but the ones around the new C-API body would need to be addressed.

auto array_schema = serialization::array_schema_deserialize(
serialization_type_, returned_data, memory_tracker_);

array_schema->set_array_uri(uri);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some background on why this was needed? Some test failing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this change came about when testing against the REST server. The REST server does not serialize the array URI (and I think it may also be different on the backend).

I will remove this line and run the unit tests to see what happened - probably will get to that later this week

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this we get an error on the client side in tiledb::sm::load_enumeration_into_schema.

if (array_schema.array_uri().is_tiledb()) {
    /* it's remote, talk to REST server */
} else {
    /* it's local, read filesystem */
}

If we don't set the array URI here then it is blank. As a result is_tiledb is false and the schema tries to read the local filesystem, which results in error when it tries to open the empty path.

tiledb/sm/array/array.cc Show resolved Hide resolved
auto tracker = ctx.resources().ephemeral_memory_tracker();
// Pass an empty list of enumeration names. REST will use timestamps to
// load all enumerations on all schemas for the array within that range.
auto ret = rest_client.post_enumerations_from_rest(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaunrd0 please add a TODO in your #5181 or its ticket to address this. I don't know of any other place where we do 2 REST requests so we should really treat that as temporary and fix it asap.

ihnorton pushed a commit that referenced this pull request Nov 14, 2024
#5359 was merged and then there were some additional changes requested.
This pull request implements those changes.
- [fix a comment
typo](#5359 (comment))
- [avoid nested C API calls and fix memory
leak](#5359 (comment))
- [add a test with schema
evolution](#5359 (comment))

---
TYPE: BUG
DESC: Follow up on #5359
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants