Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osm2pgsql: stuck lease during data import involving DROP TABLE #52065

Closed
otan opened this issue Jul 29, 2020 · 4 comments · Fixed by #52384
Closed

osm2pgsql: stuck lease during data import involving DROP TABLE #52065

otan opened this issue Jul 29, 2020 · 4 comments · Fixed by #52384
Assignees
Labels
A-schema-changes C-investigation Further steps needed to qualify. C-label will change.

Comments

@otan
Copy link
Contributor

otan commented Jul 29, 2020

We currently do this with four tables at the same table, non-transactionally:

  • DROP TABLE IF NOT EXISTS planet_osm_point
  • DROP TABLE IF NOT EXISTS planet_osm_point_tmp
  • CREATE TABLE planet_osm_point ( .... )
  • COPY in data
  • CREATE TABLE planet_osm_point_tmp AS SELECT * FROM planet_osm_point WHERE ST_IsValid(way) ORDER BY way
  • DROP TABLE planet_osm_point

The final DROP gets stuck for 5 minutes. I've noticed this fails even with concurrency of the tool I'm running set to one, but it doesn't seem to fail using the cockroach sql shell....

Reproduction steps:

The SHOW JOBS appears as following:

        job_id       |     job_type     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | statement | user_name |  status   | running_status |             created              |             started              |             finished             |             modified             | fraction_completed | error | coordinator_id
---------------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+-----------+----------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+--------------------+-------+-----------------
  576613482874667009 | SCHEMA CHANGE    | DROP TABLE defaultdb.public.planet_osm_polygon                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |           | root      | running   | NULL           | 2020-07-29 16:07:50.113896+00:00 | 2020-07-29 16:07:50.227218+00:00 | NULL                             | 2020-07-29 16:07:50.113896+00:00 |                  0 |       |              1
....
  576613268613726209 | SCHEMA CHANGE GC | GC for DROP TABLE IF EXISTS defaultdb.public.planet_osm_polygon CASCADE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |           | root      | running   | NULL           | 2020-07-29 16:06:44.728305+00:00 | 2020-07-29 16:06:44.780214+00:00 | NULL                             | 2020-07-29 16:06:44.815253+00:00 |                  0 |       |              1

(The GC job probably exists because I created and dropped an earlier table).

Here is a dump I have of the logs from cockroach start-single-node from a clean cockroach-data directory: https://gist.github.com/0475e3348605fd59ef2bb9e1d5de6a60. There seems to be a lot of statistic changes, wonder if that's part of the problem

@otan otan added A-schema-changes C-investigation Further steps needed to qualify. C-label will change. labels Jul 29, 2020
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Jul 29, 2020
@otan otan changed the title osm2pgsql: stuck lease during bulk IMPORT osm2pgsql: stuck lease during data import involving DROP TABLE Jul 29, 2020
@ajwerner
Copy link
Contributor

These symptoms indicate that somewhere we're holding on to a lease for planet_osm_point. @rohany indicated that the hack here might be causing problems:

// TODO (rohany): This is a hack and should be removed once #51865 lands.
// In case we used this planner to access any descriptors, we need to
// release any leases we might have acquired during planning and execution.
defer func() { localPlanner.Descriptors().ReleaseAll(ctx) }()
stmt, err := parser.ParseOne(table.CreateQuery)
if err != nil {
return err
}

@otan
Copy link
Contributor Author

otan commented Jul 29, 2020

I tried with that line removed, didn't seem to fix it.

@ajwerner
Copy link
Contributor

ajwerner commented Aug 5, 2020

Well this one turned out to be fun. We apparently have this thing called the copyMachine to deal with the COPY protocol that I know nothing about that apparently is getting used here. Anyway, it sidesteps the usual connExecutor state machine but didn't clean up after itself.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 5, 2020
Apparently we support some sort of COPY protocol that I know nothing about.
We allow operations in that protocol in the open state and in the noTxn state
in the connExecutor. In the noTxn state we let the `copyMachine` handle its
transaction lifecycle all on its own. In that case, we also hand have the
`connExecutor` in a fresh state when calling `execCopyIn()`. During the
execution of `COPY`, it seems like sometime we can pick up table descriptor
leases. In the noTxn case we'd like to drop those leases and generally leave
the executor in the fresh state in which it was handed to us. To deal with
that, we call `resetExtraTxnState` before returning in the happy noTxn case.

Fixes cockroachdb#52065.

Release note (bug fix): Fixed a bug when using the COPY protocol which could
prevent schema changes for up to 5 minutes.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 5, 2020
Apparently we support some sort of COPY protocol that I know nothing about.
We allow operations in that protocol in the open state and in the noTxn state
in the connExecutor. In the noTxn state we let the `copyMachine` handle its
transaction lifecycle all on its own. In that case, we also hand have the
`connExecutor` in a fresh state when calling `execCopyIn()`. During the
execution of `COPY`, it seems like sometime we can pick up table descriptor
leases. In the noTxn case we'd like to drop those leases and generally leave
the executor in the fresh state in which it was handed to us. To deal with
that, we call `resetExtraTxnState` before returning in the happy noTxn case.

Fixes cockroachdb#52065.

Release note (bug fix): Fixed a bug when using the COPY protocol which could
prevent schema changes for up to 5 minutes.
craig bot pushed a commit that referenced this issue Aug 5, 2020
48842: sql: fix portals after exhausting rows r=yuzefovich a=yuzefovich

Previously, we would erroneously restart the execution from the very
beginning of empty, unclosed portals after they have been fully
exhausted when we should be returning no rows or an error in such
scenarios. This is now fixed by tracking whether a portal is exhausted
or not and intercepting the calls to `execStmt` when the conn executor
state machine is in an open state.

Note that the current solution has known deviations from Postgres:
- when attempting to execute portals of statements that don't return row
sets, on the second and consequent attempt PG returns an error while we
are silently doing nothing (meaning we do not run the statement at all
and return 0 rows)
- we incorrectly populate "command tag" field of pgwire messages of some
rows-returning statements after the portal suspension (for example,
a suspended UPDATE RETURNING in PG will return the total count of "rows
affected" while we will return the count since the last suspension).

These deviations are deemed acceptable since this commit fixes a much
worse problem - re-executing an exhausted portal (which could be
a mutation meaning, previously, we could have executed a mutation
multiple times).

The reasons for why this commit does not address these deviations are:
- Postgres has a concept of "portal strategy"
(see https://github.com/postgres/postgres/blob/2f9661311b83dc481fc19f6e3bda015392010a40/src/include/utils/portal.h#L89).
- Postgres has a concept of "command" type (these are things like
SELECTs, UPDATEs, INSERTs, etc,
see https://github.com/postgres/postgres/blob/1aac32df89eb19949050f6f27c268122833ad036/src/include/nodes/nodes.h#L672).

CRDB doesn't have these concepts, and without them any attempt to
simulate Postgres results in a very error-prone and brittle code.

Fixes: #48448.

Release note (bug fix): Previously, CockroachDB would erroneously
restart the execution of empty, unclosed portals after they have been
fully exhausted, and this has been fixed.

52098: sql: better distribute distinct processors r=yuzefovich a=yuzefovich

**sql: better distribute distinct processors**

The distinct processors are planned in two stages - first, a "local"
stage is planned on the same nodes as the previous stage, then,
a "global" stage is added. Previously, the global stage would be planned
on the gateway, and this commit changes that to make it distributed - by
planning "global" distinct processors on the same nodes as the "local"
ones and connecting them via a hash router hashing the distinct columns.

Release note: None

**sql: implement ConstructDistinct in the new factory**

Addresses: #47473.

Release note: None

52358: engine: set the MVCC timestamp on reads due to historical intents r=ajwerner a=ajwerner

Before this commit, we'd return a zero-value MVCC timestamp when reading an
intent from the intent history. This was problematic because elsewhere in the
code we assume that we will always get a non-zero MVCC timestamp when a read
returns a value. This is especially bizarre given that a read of the latest
intent will return its write timestamp.

The semantics here are such that we'll return the timestamp of the MVCC
metadata for the row. I think this is the most reasonable thing to do as
that timestamp ought to reflect the timestamp we return when we return the
current intent and furthermore is the only timestamp we really have around.
We could return the transactions current read or write timestamp but those
both seem like worse choices.

It's also worth noting that in the case where we need to read a non-zero
value, we don't really care what that value is and the fact that we are
reading this intent itself is somewhat suspect. That being said, I still
think we should make this change in addition to any change made to prevent
the code referenced in #49266 from needing it.

Fixes #49266.
Informs #50102.

Release note: None

52384: sql: properly reset extraTxnState in COPY r=ajwerner a=ajwerner

Apparently we support some sort of COPY protocol that I know nothing about.
We allow operations in that protocol in the open state and in the noTxn state
in the connExecutor. In the noTxn state we let the `copyMachine` handle its
transaction lifecycle all on its own. In that case, we also hand have the
`connExecutor` in a fresh state when calling `execCopyIn()`. During the
execution of `COPY`, it seems like sometime we can pick up table descriptor
leases. In the noTxn case we'd like to drop those leases and generally leave
the executor in the fresh state in which it was handed to us. To deal with
that, we call `resetExtraTxnState` before returning in the happy noTxn case.

Fixes #52065.

Release note (bug fix): Fixed a bug when using the COPY protocol which could
prevent schema changes for up to 5 minutes.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in f5b7b67 Aug 6, 2020
@otan
Copy link
Contributor Author

otan commented Aug 6, 2020

thanks!!!

ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 6, 2020
Apparently we support some sort of COPY protocol that I know nothing about.
We allow operations in that protocol in the open state and in the noTxn state
in the connExecutor. In the noTxn state we let the `copyMachine` handle its
transaction lifecycle all on its own. In that case, we also hand have the
`connExecutor` in a fresh state when calling `execCopyIn()`. During the
execution of `COPY`, it seems like sometime we can pick up table descriptor
leases. In the noTxn case we'd like to drop those leases and generally leave
the executor in the fresh state in which it was handed to us. To deal with
that, we call `resetExtraTxnState` before returning in the happy noTxn case.

Fixes cockroachdb#52065.

Release note (bug fix): Fixed a bug when using the COPY protocol which could
prevent schema changes for up to 5 minutes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants