osm2pgsql: stuck lease during data import involving DROP TABLE #52065

otan · 2020-07-29T16:38:07Z

We currently do this with four tables at the same table, non-transactionally:

DROP TABLE IF NOT EXISTS planet_osm_point
DROP TABLE IF NOT EXISTS planet_osm_point_tmp
CREATE TABLE planet_osm_point ( .... )
COPY in data
CREATE TABLE planet_osm_point_tmp AS SELECT * FROM planet_osm_point WHERE ST_IsValid(way) ORDER BY way
DROP TABLE planet_osm_point

The final DROP gets stuck for 5 minutes. I've noticed this fails even with concurrency of the tool I'm running set to one, but it doesn't seem to fail using the cockroach sql shell....

Reproduction steps:

Checkout osm2pgsql-dev/osm2pgsql@master...otan-cockroach:crdb_branch, build using cd build && cmake .. && make && cd ..
Run cockroach single node
Download the dataset: https://download.geofabrik.de/europe/iceland-latest.osm.pbf
Run the tool ./build/osm2pgsql -U root -d defaultdb -H localhost -P 26257 --number-processes 1 -C 20480 ~/iceland-latest.osm.pbf

The SHOW JOBS appears as following:

        job_id       |     job_type     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | statement | user_name |  status   | running_status |             created              |             started              |             finished             |             modified             | fraction_completed | error | coordinator_id
---------------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+-----------+----------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+--------------------+-------+-----------------
  576613482874667009 | SCHEMA CHANGE    | DROP TABLE defaultdb.public.planet_osm_polygon                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |           | root      | running   | NULL           | 2020-07-29 16:07:50.113896+00:00 | 2020-07-29 16:07:50.227218+00:00 | NULL                             | 2020-07-29 16:07:50.113896+00:00 |                  0 |       |              1
....
  576613268613726209 | SCHEMA CHANGE GC | GC for DROP TABLE IF EXISTS defaultdb.public.planet_osm_polygon CASCADE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |           | root      | running   | NULL           | 2020-07-29 16:06:44.728305+00:00 | 2020-07-29 16:06:44.780214+00:00 | NULL                             | 2020-07-29 16:06:44.815253+00:00 |                  0 |       |              1

(The GC job probably exists because I created and dropped an earlier table).

Here is a dump I have of the logs from cockroach start-single-node from a clean cockroach-data directory: https://gist.github.com/0475e3348605fd59ef2bb9e1d5de6a60. There seems to be a lot of statistic changes, wonder if that's part of the problem

The text was updated successfully, but these errors were encountered:

ajwerner · 2020-07-29T18:11:16Z

These symptoms indicate that somewhere we're holding on to a lease for planet_osm_point. @rohany indicated that the hack here might be causing problems:

cockroach/pkg/sql/schema_changer.go

Lines 220 to 227 in 27f8f1e

    
           // TODO (rohany): This is a hack and should be removed once #51865 lands. 
        
           // In case we used this planner to access any descriptors, we need to 
        
           // release any leases we might have acquired during planning and execution. 
        
           defer func() { localPlanner.Descriptors().ReleaseAll(ctx) }() 
        
           stmt, err := parser.ParseOne(table.CreateQuery) 
        
           if err != nil { 
        
           	return err 
        
           }

otan · 2020-07-29T18:29:58Z

I tried with that line removed, didn't seem to fix it.

ajwerner · 2020-08-05T02:57:08Z

Well this one turned out to be fun. We apparently have this thing called the copyMachine to deal with the COPY protocol that I know nothing about that apparently is getting used here. Anyway, it sidesteps the usual connExecutor state machine but didn't clean up after itself.

Apparently we support some sort of COPY protocol that I know nothing about. We allow operations in that protocol in the open state and in the noTxn state in the connExecutor. In the noTxn state we let the `copyMachine` handle its transaction lifecycle all on its own. In that case, we also hand have the `connExecutor` in a fresh state when calling `execCopyIn()`. During the execution of `COPY`, it seems like sometime we can pick up table descriptor leases. In the noTxn case we'd like to drop those leases and generally leave the executor in the fresh state in which it was handed to us. To deal with that, we call `resetExtraTxnState` before returning in the happy noTxn case. Fixes cockroachdb#52065. Release note (bug fix): Fixed a bug when using the COPY protocol which could prevent schema changes for up to 5 minutes.

48842: sql: fix portals after exhausting rows r=yuzefovich a=yuzefovich Previously, we would erroneously restart the execution from the very beginning of empty, unclosed portals after they have been fully exhausted when we should be returning no rows or an error in such scenarios. This is now fixed by tracking whether a portal is exhausted or not and intercepting the calls to `execStmt` when the conn executor state machine is in an open state. Note that the current solution has known deviations from Postgres: - when attempting to execute portals of statements that don't return row sets, on the second and consequent attempt PG returns an error while we are silently doing nothing (meaning we do not run the statement at all and return 0 rows) - we incorrectly populate "command tag" field of pgwire messages of some rows-returning statements after the portal suspension (for example, a suspended UPDATE RETURNING in PG will return the total count of "rows affected" while we will return the count since the last suspension). These deviations are deemed acceptable since this commit fixes a much worse problem - re-executing an exhausted portal (which could be a mutation meaning, previously, we could have executed a mutation multiple times). The reasons for why this commit does not address these deviations are: - Postgres has a concept of "portal strategy" (see https://github.com/postgres/postgres/blob/2f9661311b83dc481fc19f6e3bda015392010a40/src/include/utils/portal.h#L89). - Postgres has a concept of "command" type (these are things like SELECTs, UPDATEs, INSERTs, etc, see https://github.com/postgres/postgres/blob/1aac32df89eb19949050f6f27c268122833ad036/src/include/nodes/nodes.h#L672). CRDB doesn't have these concepts, and without them any attempt to simulate Postgres results in a very error-prone and brittle code. Fixes: #48448. Release note (bug fix): Previously, CockroachDB would erroneously restart the execution of empty, unclosed portals after they have been fully exhausted, and this has been fixed. 52098: sql: better distribute distinct processors r=yuzefovich a=yuzefovich **sql: better distribute distinct processors** The distinct processors are planned in two stages - first, a "local" stage is planned on the same nodes as the previous stage, then, a "global" stage is added. Previously, the global stage would be planned on the gateway, and this commit changes that to make it distributed - by planning "global" distinct processors on the same nodes as the "local" ones and connecting them via a hash router hashing the distinct columns. Release note: None **sql: implement ConstructDistinct in the new factory** Addresses: #47473. Release note: None 52358: engine: set the MVCC timestamp on reads due to historical intents r=ajwerner a=ajwerner Before this commit, we'd return a zero-value MVCC timestamp when reading an intent from the intent history. This was problematic because elsewhere in the code we assume that we will always get a non-zero MVCC timestamp when a read returns a value. This is especially bizarre given that a read of the latest intent will return its write timestamp. The semantics here are such that we'll return the timestamp of the MVCC metadata for the row. I think this is the most reasonable thing to do as that timestamp ought to reflect the timestamp we return when we return the current intent and furthermore is the only timestamp we really have around. We could return the transactions current read or write timestamp but those both seem like worse choices. It's also worth noting that in the case where we need to read a non-zero value, we don't really care what that value is and the fact that we are reading this intent itself is somewhat suspect. That being said, I still think we should make this change in addition to any change made to prevent the code referenced in #49266 from needing it. Fixes #49266. Informs #50102. Release note: None 52384: sql: properly reset extraTxnState in COPY r=ajwerner a=ajwerner Apparently we support some sort of COPY protocol that I know nothing about. We allow operations in that protocol in the open state and in the noTxn state in the connExecutor. In the noTxn state we let the `copyMachine` handle its transaction lifecycle all on its own. In that case, we also hand have the `connExecutor` in a fresh state when calling `execCopyIn()`. During the execution of `COPY`, it seems like sometime we can pick up table descriptor leases. In the noTxn case we'd like to drop those leases and generally leave the executor in the fresh state in which it was handed to us. To deal with that, we call `resetExtraTxnState` before returning in the happy noTxn case. Fixes #52065. Release note (bug fix): Fixed a bug when using the COPY protocol which could prevent schema changes for up to 5 minutes. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

otan · 2020-08-06T03:56:58Z

thanks!!!

Apparently we support some sort of COPY protocol that I know nothing about. We allow operations in that protocol in the open state and in the noTxn state in the connExecutor. In the noTxn state we let the `copyMachine` handle its transaction lifecycle all on its own. In that case, we also hand have the `connExecutor` in a fresh state when calling `execCopyIn()`. During the execution of `COPY`, it seems like sometime we can pick up table descriptor leases. In the noTxn case we'd like to drop those leases and generally leave the executor in the fresh state in which it was handed to us. To deal with that, we call `resetExtraTxnState` before returning in the happy noTxn case. Fixes cockroachdb#52065. Release note (bug fix): Fixed a bug when using the COPY protocol which could prevent schema changes for up to 5 minutes.

otan assigned ajwerner Jul 29, 2020

otan added A-schema-changes C-investigation Further steps needed to qualify. C-label will change. labels Jul 29, 2020

cockroachdb deleted a comment from blathers-crl bot Jul 29, 2020

otan changed the title ~~osm2pgsql: stuck lease during bulk IMPORT~~ osm2pgsql: stuck lease during data import involving DROP TABLE Jul 29, 2020

ajwerner mentioned this issue Aug 5, 2020

sql: properly reset extraTxnState in COPY #52384

Merged

craig bot closed this as completed in f5b7b67 Aug 6, 2020

ajwerner mentioned this issue Aug 6, 2020

release-20.1: sql: properly reset extraTxnState in COPY #52455

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osm2pgsql: stuck lease during data import involving DROP TABLE #52065

osm2pgsql: stuck lease during data import involving DROP TABLE #52065

otan commented Jul 29, 2020 •

edited

Loading

ajwerner commented Jul 29, 2020

otan commented Jul 29, 2020

ajwerner commented Aug 5, 2020

otan commented Aug 6, 2020

osm2pgsql: stuck lease during data import involving DROP TABLE #52065

osm2pgsql: stuck lease during data import involving DROP TABLE #52065

Comments

otan commented Jul 29, 2020 • edited Loading

ajwerner commented Jul 29, 2020

otan commented Jul 29, 2020

ajwerner commented Aug 5, 2020

otan commented Aug 6, 2020

otan commented Jul 29, 2020 •

edited

Loading