Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why execute() doesn't support arguments? #51

Closed
tailhook opened this issue Nov 30, 2019 · 19 comments
Closed

Why execute() doesn't support arguments? #51

tailhook opened this issue Nov 30, 2019 · 19 comments
Labels
enhancement New feature or request

Comments

@tailhook
Copy link
Contributor

My understanding that switching from fetchone to execute should be seamless if you don't need the result. But it's not the case:

  File "app.py", line 76, in login
    await request.app['edb'].execute("""
TypeError: execute() got an unexpected keyword argument 'cookie_id'

The error is confusing.

Technically, I understand that it's because there is such thing in a protocol. And that thing in protocol
may make sense for scripts or something, but it doesn't make sense from the point of view of the user of python bindings (other database bindings have no such issue as far as I remember).

So perhaps execute should accept arguments and silently switch to fetchone internally when at least one argument is specified?

@elprans
Copy link
Member

elprans commented Nov 30, 2019

execute() is designed to run (possibly) multiple statements, and it is not possible to parametrize an arbitrary EdgeQL script per the current protocol. Even if the driver was able to distinguish single statement from multiple in execute(), it would still be a bad API to allow arguments only in some cases.

Your confusion possibly stems from exposure to Python's DB API, where execute() is how you run all queries, but then you have to obtain the results separately, which is extremely weird.

@1st1 let's consider alternative names for the "statement" and the "script" methods, because I've seen similar confusion from asyncpg users.

@1st1
Copy link
Member

1st1 commented Nov 30, 2019

Technically, I understand that it's because there is such thing in a protocol. And that thing in protocol may make sense for scripts or something

Yes

but it doesn't make sense from the point of view of the user of python bindings (other database bindings have no such issue as far as I remember).

asyncpg is designed exactly the same way. We've never had a user requesting execute() to behave like fetchval() when it has arguments.

Most Python bindings adhere to the Python DB API which is broken on pretty much all levels. Like messed up arguments encoding, or inability to prepare statements, or misusing the "cursor" terminology. Unfortunately (and I really mean that) we can't use existing Python bindings as an example, simply because their API is broken.

So perhaps execute should accept arguments and silently switch to fetchone internally when at least one argument is specified?

I think it would just lead to more a more confusing UX. If we do it the way you suggest we will have bug reports like "why aren't arguments passed to all queries in the script I pass to the execute() method".

As an alternative we can consider renaming execute() to execute_script() or something. But again, due to the fact that asyncpg has a very similar API and it seems that there's no confusion here, I don't think the motivation is strong enough here.

@elprans what do you think?

@1st1
Copy link
Member

1st1 commented Nov 30, 2019

@1st1 let's consider alternative names for the "statement" and the "script" methods, because I've seen similar confusion from asyncpg users.

Some links from my slack conversation with @elprans:

There's some confusion with execute() and kwargs but it doesn't have enough traction IMO. Some asyncpg issues have 50 stars and 50 comments in them with people posting "+1" -- this isn't one of them.

The bigger issue, though, is that people don't realize that they can use asyncpg.Connection.fetchval() to run INSERT queries. That one is real and I forgot about it; thanks for reminding.

This isn't new though, we had multiple discussions about INSERT / fetch() mental mismatch without any good ideas how to fix it. Let's brainstorm, I guess.


I'll start. One way would be to re-purpose execute():

  • Connection.fetchone() would become Connection.fetch() -- fetch a singleton-result query. args & kwargs are supported.
  • Connection.fetchall() would become Connection.execute() -- fetch all results of a qurey. args & kwargs are supported.
  • Connection.execute() would become Connection.run() or Connection.execute_script(). No args/kwargs are supported (at least for now).

@1st1
Copy link
Member

1st1 commented Nov 30, 2019

I started to write a big comment in edb/server/edgecon.pyx to explain why we compile one EdgeDB SimpleQuery message to one PostgreSQL SimpleQuery message only to see that it's no longer the case. The protocol implementation saw a lot of churn in the beginning so I'll use this issue to clear a few things up and document what I think should be our API strategy for language bindings.

Simple Query Protocol

Originally we did have a 1 to 1 mapping between our SimpleQuery messages and what we send to PostgreSQL. Eventually we changed our strategy to 1 to many. Here's why:

  • A Postgres SimpleQuery message commits transactions immediately on COMMIT even if there happen to occur exceptions right after it:

    START TRANSACTION;
      CREATE TABLE foo(...);
    COMMIT;
    
    SELECT 1 / 0;
    # Error! But the "foo" table has been created already and will stay.
  • So SQL transactions within a single SimpleQuery are immediately committed. Here's a more relevant example for us:

    START TRANSACTION;
      CREATE TABLE foo(...);
    COMMIT;
    
    SELECT pg_sleep(100);
    
    START TRANSACTION;
      CREATE TABLE bar(...);
    COMMIT;

    Essentially, we create a "foo" table, wait a bit, and create a "bar" table. The "foo" table is committed right before the pg_sleep() call.

  • Why is this important in the context of EdgeDB?

    PostgreSQL is solely responsible for its state.

    EdgeDB is responsible for handling its own state spread across several processes (current schema version, current session and system settings, etc) and for making sure that PostgreSQL's state is in full sync with EdgeDB's state.

    This complication means that a single SimpleQuery with EdgeQL like the below:

    START TRANSACTION;
      ALTER TYPE foo { ... };
    COMMIT;
    
    SELECT sys::sleep(100);
    
    START TRANSACTION;
      CREATE TYPE bar { ...};
    COMMIT;

    Would be compiled into 3 SQL blocks each run via Postgres' SimpleQuery:

    START TRANSACTION;
      ALTER TABLE foo (...);
    COMMIT;

    EdgeDB will know that the first query block modifies the schema state if executed to completion. Therefore EdgeDB will signal all compilers to reload schema.

    SELECT sys::sleep(100);
    START TRANSACTION;
      CREATE TABLE bar (...);
    COMMIT;

    EdgeDB will know that the last query block also modifies the schema state. Therefore EdgeDB will signal all compilers to reload schema again.

Extended Query Protocol

In PostgreSQL the Simple Query protocol is just one way of executing queries. It can execute a batch of SQL queries but:

  • it doesn't allow to use query arguments;
  • it returns data in TEXT encoding.

There's an alternative, called Extended Query protocol. It allows to return data in BINARY, to receive query parameters, to prepare named and anonymous statements, etc.

EdgeDB Extended Query protocol is a pretty much one-to-one mapping to Postgres Extended Query protocol. We do a bunch of tricks on the server side to maintain an LRU cache and auto-prepare queries, but generally it's the same thing.

EdgeDB Protocol & Python API

Currently in EdgeDB Python API we have the following APIs:

  • Connection.fetchone()—Implemented via our Extended Query protocol version. Returns data, accepts arguments. Only works with single-command queries that return singular results.

  • Connection.fetchall()—Implemented via our Extended Query protocol version. Returns data, accepts arguments. Works with any single-command query.

  • Connection.execute()—Implemented via our version of Simple Query protocol. Does not return data (EdgeDB ignores any data returned from Postgres' Simple Query). Works with any single- or multi-command query.

Paul's original question is valid: the inability of Connection.execute() to receive arguments is inconvenient. Especially in cases when a user just wants to insert/update some data and doesn't care about the results.

There's also another angle how to look at this: in asyncpg we have Connection.executemany() that allows to batch a query with a series of arguments—essentially pipeline it. We do want to have that in EdgeDB eventually.

One way to add "executemany" would be to add a new protocol message that would receive an EdgeQL command(s) and a stream of arguments. This way in the future we can extend Connection.execute() to accept arguments and add Connection.executemany().

That new protocol flow will use a combination of Simple Query and Extended Query Postgres messages. The former is optimal for blocks of SQL without parameters, the latter can receive arguments. It's all complicated by the fact that we need to add cancellation handling to all of that, making this quite a tricky thing to implement correctly. We'll do that eventually, though.


Therefore I think that right now it's better to do nothing. :) We can always extend our protocol with this "batch execute" message flow. We can at later point add arguments to Connection.execute().

Another thing to consider: adding all kinds of APIs to the Python driver means that we'd want to add them to all other supported languages — this significantly increases the surface for us. We don't have that time. We should focus on the hosted version and releasing a stable 1.0.

In conclusion I'd say that the current API we have in Python is forward-compatible. Let's keep it as is.

@tailhook
Copy link
Contributor Author

tailhook commented Dec 1, 2019

I'm not sure how @1st1's comment addresses the multi-statement issue.

I mean. I'm okay if we rename execute -> execute_script or execute_batch or alike (and leave everything as is)

But if we keep current semantics I don't think we can add arguments later and it will also require renaming executemany to something else (because I don't think it's a good idea to allow multi-statements with many parameter sets)

@elprans
Copy link
Member

elprans commented Dec 1, 2019

because I don't think it's a good idea to allow multi-statements with many parameter sets

@tailhook can you elaborate on this concern please?

@1st1
Copy link
Member

1st1 commented Dec 1, 2019

I mean. I'm okay if we rename execute -> execute_script or execute_batch or alike (and leave everything as is)

No, we rename nothing now. Later we add a new flow for execute and will be able to add args to Connection.execute() as well as implement Connection.executemany().

because I don't think it's a good idea to allow multi-statements with many parameter sets

@tailhook can you elaborate on this concern please?

I'm also curious.

@tailhook
Copy link
Contributor Author

tailhook commented Dec 1, 2019

  • If we allow execute("insert X { a: = 1}; insert Y { b: = 2 };") now.
  • We must allow execute("insert X { a: = $a}; insert Y { b: = $b };", a=1, b=2)
  • And also executemany("insert X { a: = $a}; insert Y { b: = $b };", [dict(a=1, b=2)])

The last two are hard to implement. So I propose to rename multi-statement function to execute_batch. And reserve execute and executemany functions to single-statement queries.

@elprans
Copy link
Member

elprans commented Dec 1, 2019

The last two are hard to implement.

How did you come to this conclusion? Seems straightforward to me, if we add an appropriate protocol message.

@tailhook
Copy link
Contributor Author

tailhook commented Dec 1, 2019

How did you come to this conclusion? Seems straightforward to me, if we add an appropriate protocol message.

Well, off the top of my head:

  1. Semantics is unclear: are x; y; z; run as x; x; y; y; z; z or x; y; z; x; y; z; (for the standpoint of the protocol optimization is last one, but as a user?)
  2. You need to parse and validate all the three queries before running the first one
  3. Variable types must match if variable used in multiple queries (can variables be cast to different types?)
  4. Unused variables figured out between queries
  5. More tests must be written
  6. Not sure, but I also expect some issues if transactional statements used in the script.

I'm not sure how far it's on the spectrum of just a bit more work or super-hard to implement. But I would start with single-statement impl first.

@tailhook
Copy link
Contributor Author

tailhook commented Dec 1, 2019

By the way, does it change the design if we need execmany returning result? Say insert 10 objects and return ids? (fetchonemany isn't good name)

@elprans
Copy link
Member

elprans commented Dec 1, 2019

Semantics is unclear: are x; y; z; run as x; x; y; y; z; z or x; y; z; x; y; z; (for the standpoint of the protocol optimization is last one, but as a user?)

As a user, I would expect the last one as well. This is solved by documentation.

You need to parse and validate all the three queries before running the first one

Sure, that's doesn't seem like a big problem.

Variable types must match if variable used in multiple queries (can variables be cast to different types?)

The parameter type must be unambiguous, this is easy to validate and report. If you absolutely must pass the same value to multiple queries in a different type context, use text and casts or converters.

Unused variables figured out between queries

This is not a problem, as the entire script is parsed and validated as a unit.

More tests must be written

Like with any functionality.

Not sure, but I also expect some issues if transactional statements used in the script.

What issues? How is the situation different from an unparametrized script?

@elprans
Copy link
Member

elprans commented Dec 1, 2019

By the way, does it change the design if we need execmany returning result? Say insert 10 objects and return ids? (fetchonemany isn't good name)

Returning methods must be single-statement, because there is no reasonable way to reconcile the result type shape for arbitrary statements. Also, the only reason why you might need a separate executemany protocol for the returning statements is pipelining, but it's far from clear if this is really necessary in the field.

If you want to insert and return, then you're likely inserting a small amount of data, and pipelining is only worth it on huge volumes, like ETL workloads, and there you are almost always better off by inserting with executemany() followed by fetchall() (since the point of your load was most likely to run some transform in the database).

@tailhook
Copy link
Contributor Author

tailhook commented Dec 1, 2019

You're probably better at estimating this specific thing. If you think this is not hard enough I'm okay with that.

Let's keep the issue open in the meantime until arguments are implemented.

@1st1 1st1 added the enhancement New feature or request label Dec 2, 2019
@1st1
Copy link
Member

1st1 commented Mar 10, 2020

Latest thoughts on this:

The motivation to going away from the "fetch" term is explained in comments above; essentially it boils down to people being confused that fetch should be used to execute INSERT commands.

"fetch" is probably coming from asyncpg / Python DB API; "query" is used in Golang / JS / Rust / etc.

@aeros
Copy link
Contributor

aeros commented Apr 16, 2020

@1st1

Rename Connection.fetchone() to Connection.queryone()
Rename Connection.fetchall() to Connection.query()

The motivation to going away from the "fetch" term is explained in comments above; essentially it boils down to people being confused that fetch should be used to execute INSERT commands.

Looking back in retrospect from the college project I worked on w/ EdgeDB, I can attest to the Connection API being a bit confusing at first. Not just coming from Python, but also in Java DB APIs where it's rather common to use Statement.execute() for any SQL statement and potentially fetch the results separately if it returned true. As @elprans mentioned earlier:

Your confusion possibly stems from exposure to Python's DB API, where execute() is how you run all queries, but then you have to obtain the results separately, which is extremely weird.

I can agree that it's weird and not great from an API design perspective, but the reality is that many DB API users are very much used to it. Weirdness and poor design can easily become intuitive if exposed to it enough; I think that's human nature to some degree. That's not to say that EdgeDB's API should suffer in quality because of existing precedent of course, but IMO it should be taken into account in the docs.

Thus, I think it would be worthwhile to briefly mention in the docs for Connection.execute() as a note at the bottom that if the results are desired, queryone() or query() should be used instead. This should help significantly in redirecting users that may naturally gravitate towards using execute() for everything in other DB APIs. Although they're mentioned in the same page(s), it's highly common in my experience for users to only read the documentation for a single method in isolation if they're looking for something specific; e.g. an execute() method.

If it would be helpful, I can work on the above in addition to the rename. I should have time in the next week, if not sooner, to open a PR (or two) for it.

Quick question though: should there be some form of deprecation process or anything to smooth the transition from fetchone() -> queryone() and fetchall() -> query()? My initial assumption would be that backwards API compatibility isn't as much of a concern for EdgeDB with it currently being in alpha, but I'd like to be certain; especially coming from experience w/ CPython where it's a primary concern.

1st1 added a commit that referenced this issue Jul 13, 2020
This commit renames the following APIs:

* fetchall() -> query()
* fetchone() -> query_one()
* fetchall_json() -> query_json()
* fetchone_json() -> query_one_json()

The above methods were renamed for the async connection,
the blocking connection, and for the async pool.

See issue #51 for details.
1st1 added a commit that referenced this issue Jul 13, 2020
This commit renames the following APIs:

* fetchall() -> query()
* fetchone() -> query_one()
* fetchall_json() -> query_json()
* fetchone_json() -> query_one_json()

The above methods were renamed for the async connection,
the blocking connection, and for the async pool.

See issue #51 for details.
1st1 added a commit that referenced this issue Jul 13, 2020
* API rename: "fetch" to "query". See issue #51 for details.
@1st1 1st1 mentioned this issue Jul 13, 2020
1st1 added a commit that referenced this issue Jul 13, 2020
* API rename: "fetch" to "query". See issue #51 for details.
1st1 added a commit to geldata/gel-js that referenced this issue Jul 14, 2020
1st1 added a commit to geldata/gel-js that referenced this issue Jul 14, 2020
1st1 added a commit that referenced this issue Jul 14, 2020
* API rename: "fetch" to "query" (#51)
* Expose `query*()` methods in the Pool class directly
@1st1 1st1 mentioned this issue Jul 14, 2020
1st1 added a commit that referenced this issue Jul 14, 2020
* API rename: "fetch" to "query" (#51)
* Expose `query*()` methods in the Pool class directly
@1st1
Copy link
Member

1st1 commented Jul 15, 2020

Thus, I think it would be worthwhile to briefly mention in the docs for Connection.execute() as a note at the bottom that if the results are desired, queryone() or query() should be used instead.

Absolutely, would appreciate PRs.

Quick question though: should there be some form of deprecation process or anything to smooth the transition from fetchone() -> queryone() and fetchall() -> query()? My initial assumption would be that backwards API compatibility isn't as much of a concern for EdgeDB with it currently being in alpha, but I'd like to be certain; especially coming from experience w/ CPython where it's a primary concern.

Of course, the change is backwards compatible. The old fetch*() API is going to be available for some time, albeit with deprecation warnings.

@1st1
Copy link
Member

1st1 commented Jul 15, 2020

I'll keep the issue open until execute() gets support for *args and **kwargs

@1st1
Copy link
Member

1st1 commented Oct 21, 2022

This has been implemented.

renawolford6 added a commit to renawolford6/edg-edb-javascript that referenced this issue Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants