Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(csharp): redefine C# APIs to prioritize full async support #1865

Merged
merged 5 commits into from
May 19, 2024

Conversation

CurtHagenlocher
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher commented May 15, 2024

For #1843, redefines the C# APIs to prioritize full async support. As this is making a number of breaking changes already, it also takes the opportunity to do some general cleanup.

Async methods that are generally expected to run locally (e.g. GetOption/SetOption) are defined to return ValueTask and have their default implementation be synchronous. Async methods that are generally expected to run remotely are defined to return Task and have their default implementations be asynchronous.

My mental model for the four ADBC "object" types is as follows:

The driver is analogous to the ODBC or JDBC driver, or ADO.NET provider. In JDBC, this type is represented by the java.sql.Driver interface. In ADO.NET, it's represented by the DbProviderFactory class. Because this object is strictly about code, it's not expected to do any IO other than that potentially required by any code e.g. to bring in pages of a binary image from disk.

The database is analogous to the JDBC DataSource, the ODBC "DSN" or the ADO.NET connection string. It represents the information and capability required to create a database connection but does not itself do IO until it tries to create a connection. (This would imply that parameter validation which requires network access -- e.g. to validate a host name -- is deferred until the connection is created. Perhaps that's too limiting?)

Because neither the driver nor the database is doing IO, neither of them need to have async methods other than Connect, including async cleanup.

The connection represents an actual session with a database. This matches an ODBC connection, a JDBC java.sql.Connection or an ADO.NET DbConnection. Opening a connection, closing it or using it to fetch information about the data source are all operations likely to require IO so these all require async implementations.

The statement is a unit of bookkeeping related to certain types of database operations. In some cases, a connection can only have a single active statement running against it, but it can be useful to have multiple statements even then if, for instance, each one is a prepared statement that represents both client-side and server-side resources. The statement is analogous to an ODBC statement, a JDBC java.sql.Statement or an ADO.NET DbCommand. Due to the need to clean up an in progress operation or to release server-side resources, the cleanup of a statement might do IO and should therefore support asynchrony. But when the statement is first created, it only represents the potential for future work and so creation is always synchronous.

I'd be curious to hear how well this aligns with others' points of view. lidavidm? davidhcoe? ("ping" removed)

@lidavidm
Copy link
Member

A method may return an instance of this value type when it's likely that the result of its operation will be available synchronously, and when it's expected to be invoked so frequently that the cost of allocating a new Task for each call will be prohibitive.

There are tradeoffs to using a ValueTask instead of a Task. For example, while a ValueTask can help avoid an allocation in the case where the successful result is available synchronously, it also contains multiple fields, whereas a Task as a reference type is a single field. This means that returning a ValueTask from a method results in copying more data. It also means, that if a method that returns a ValueTask is awaited within an async method, the state machine for that async method will be larger, because it must store a struct containing multiple fields instead of a single reference.

Wow, C#/.NET has thought this all through, pretty neat.

/// </summary>
/// <param name="codes">The metadata items to fetch.</param>
/// <returns>Metadata about the driver and/or database</returns>
public virtual Task<IArrowArrayStream> GetInfo(ReadOnlySpan<AdbcInfoCode> codes, CancellationToken cancellationToken = default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetInfoAsync?

/// <param name="targetTableName">The table name</param>
/// <param name="mode">The ingest mode</param>
/// <param name="isTemporary">True for a temporary table. Catalog and Schema must be null when true.</param>
public virtual AdbcStatement BulkIngest(string? targetCatalog, string? targetDbSchema, string targetTableName, BulkIngestMode mode, bool isTemporary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have an equivalent async?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this does is setup a statement for later binding and execution. It's not intended to do anything on the server side. We could instead have something a little more complete like

void BulkIngest(string? targetCatalog, string? targetDbSchema, string targetTableName, BulkIngestMode mode, bool isTemporary, IArrowArrayStream data);
Task BulkIngestAsync(string? targetCatalog, string? targetDbSchema, string targetTableName, BulkIngestMode mode, bool isTemporary, IArrowArrayStream data, CancellationToken cancellationToken);

and with that definition we'd obviously need an async version.

There were a few reasons I chose not to do this (or not to do this now) including the need to have both variations (the equivalents of Bind and BindStream) and the longer-term desire to support the progress API which might then require additional overloads. But I'm definitely conflicted on the best thing to do for right now.

That said, if CreateStatement has an async version then this would get one too.

/// <summary>
/// Create a new statement that can be executed.
/// </summary>
public abstract AdbcStatement CreateStatement(IReadOnlyDictionary<string, object>? options = default);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't an expensive operation, but does it need an Async equivalent for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, part of the goal of the API is to signal to implementers what they should be doing. I see a statement as a kind of unit of internal bookkeeping which doesn't use database resources until you ask it to do something else. By not having an async function for creating a statement, this intent is communicated fairly clearly to implementers. This is also consistent with ADO.NET's DbCommand, which doesn't have an async construction function (though there were likely other considerations there that we're not faced with).

If there's a demonstrated need to do something differently, we can later add an async method that returns a ValueTask and that would be consistent with everything else we have.

That said, I'm mostly neutral on adding it now so if you think it's better to have it then I'll change it.

{
}

public ValueTask DisposeAsync()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need an equivalent #if NET5_0_OR_GREATER directive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly required because you can define whatever function you want; only the interface it implements is missing in older versions of .NET. I guess my design principle here is to #ifdef only when necessary; but does that give a misleading impression in this case?

/// </summary>
/// <param name="key">Option name</param>
/// <returns>Option value.</returns>
public virtual object GetOption(string key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs equivalent Async

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Driver.Open: if the ADBC "database" is like a connection string then the options should all be local; we haven't connected to anything yet. But I'll take a look at e.g. the Flight implementation to see how well my interpretation holds up.

/// For example, in-memory databases can place ownership of the actual
/// database in this object.
/// </summary>
public abstract class AdbcDatabase11 : IDisposable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include IAsyncDisposable?

/// This provides a common interface for vendor-specific driver
/// initialization routines.
/// </summary>
public abstract class AdbcDriver11 : IDisposable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include IAsyncDisposable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidhcoe - I don't think we need IAsyncDisposable because there are no async methods in the class.

/// Open a database via this driver.
/// </summary>
/// <param name="parameters">Driver-specific parameters.</param>
public abstract AdbcDatabase Open(IReadOnlyDictionary<string, string> parameters);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need an equivalent OpenAsync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to CreateStatement. I see an ADBC database as being a bookkeeping thing; more like a connection string than something that's actually connected to anything. Through that interpretation, the name "Open" is misleading here and should probably be replaced.

If we change this for CreateStatement now then we'd change this here as well.

Copy link
Contributor

@birschick-bq birschick-bq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments/suggestions with regards to inheritance.

/// This provides a common interface for vendor-specific driver
/// initialization routines.
/// </summary>
public abstract class AdbcDriver11 : IDisposable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidhcoe - I don't think we need IAsyncDisposable because there are no async methods in the class.

/// <summary>
/// Returns the version of the ADBC spec supported by this driver.
/// </summary>
public virtual int Version => AdbcVersion.Version_1_0_0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this property be available in all classes? Or just use .NET type management?
I see AdbcDriver.DriverVersion is the functional equivalent of this property. If we inherit from AdbcDriver, I guess we should stick to DriverVersion. Personally, I'd prefer this property be named AdbcVersion to disambiguate between driver, dbms, etc? Especially since it's an int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a consumer that cares would check this when loading the driver and remember the value. Putting it on all the types seems like it would be annoying for the implementer.

Copy link
Contributor

@birschick-bq birschick-bq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all I have for now.

///
/// This must always be thread-safe.
/// </remarks>
public virtual void Cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought I had was that, in the past, I've had to send a message to the server to stop processing and end the session on the server-side. So it might be useful to have a CancelAsync (and in Statement) if the implementation requires a call to the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancel is what you call when you're using the synchronous API. This already has to be on another thread because the synchronous API for e.g. ExecuteQuery will block the thread it's called from. If you're using the asynchronous API, then you should use the cancellation token instead of calling this method.

@CurtHagenlocher CurtHagenlocher marked this pull request as ready for review May 17, 2024 23:37
@CurtHagenlocher CurtHagenlocher requested a review from lidavidm as a code owner May 17, 2024 23:37
@CurtHagenlocher
Copy link
Contributor Author

In the absence of additional feedback I'm going to check these in now and start refactoring the code to reference the newer classes. I would say the only open questions are whether or not there should be more async for the Driver and Database (I argue "no" but was hoping for more discussion) and the "BulkIngest" API which feels incomplete or wrong.

@CurtHagenlocher CurtHagenlocher merged commit ddb1bcf into apache:main May 19, 2024
6 checks passed
@CurtHagenlocher CurtHagenlocher deleted the adbc#1.1 branch May 19, 2024 20:49
@lidavidm lidavidm added this to the ADBC Libraries 13 milestone May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants