Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimescaleDB Background Worker Scheduler crash with "malloc.c:2394: sysmalloc: Assertion ... failed" #3469

Closed
jankatins opened this issue Aug 10, 2021 · 8 comments · Fixed by #3557
Assignees
Labels
bug segfault Segmentation fault

Comments

@jankatins
Copy link

jankatins commented Aug 10, 2021

Relevant system information:

  • OS: fedora 33
  • PostgreSQL version (output of postgres --version): 12.7
  • TimescaleDB version (output of \dx in psql): 2.4
  • Installation method: internal build

Describe the bug
During a automated test, we saw the following error in the log which lead to crashing tests. This error does not happen often (just saw it for the second time). The failing test was running DROP EXTENSION timescaledb CASCADE in a new connection at that time (unsure what else was running in the background). Later connections showed that the DB was was in recovery mode.

pid=55895,user=avnadmin,db=defaultdb,app=[unknown],client=[local] LOG:  disconnection: session time: 0:00:00.179 user=avnadmin database=defaultdb host=[local]
pid=55896,user=,db=,app=,client= LOG:  terminating TimescaleDB job scheduler due to administrator command
pid=55896,user=,db=,app=,client= FATAL:  terminating connection due to administrator command
pid=55417,user=,db=,app=,client= LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 55896) exited with exit code 1
pid=55898,user=[unknown],db=[unknown],app=[unknown],client=[local] LOG:  connection received: host=[local]
pid=55898,user=avnadmin,db=defaultdb,app=[unknown],client=[local] LOG:  connection authorized: user=avnadmin database=defaultdb
eDB Background Worker Scheduler   : malloc.c:2394: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
pid=55417,user=,db=,app=,client= LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 55897) was terminated by signal 6: Aborted
pid=55417,user=,db=,app=,client= LOG:  terminating any other active server processes
@mfundul
Copy link
Contributor

mfundul commented Aug 11, 2021

@jankatins can you get a backtrace of this?

It appears this is some malloc memory corruption, which timescaledb seldomly uses.

This could be a bug in timescaledb connection code which explicitly uses malloc(), or, probably less likely, in remote execution code or fdw code.

It could also be a postgresql race condition for DROP EXTENSION but I don't believe that it is at this point.

@jankatins
Copy link
Author

This happens rarely, I cannot reproduce at will :-(. In this case, the above is the only output I have (happened on a jenkins).

@jankatins
Copy link
Author

Some background, as I just saw it again in the same test:

        try:
            cursor.execute("CREATE TABLE foohyper (time TIMESTAMPTZ NOT NULL, loc TEXT NOT NULL, temp FLOAT, humid FLOAT)")
            cursor.execute("SELECT create_hypertable('foohyper', 'time')")
            cursor.execute("INSERT INTO foohyper(time, loc, temp, humid) VALUES (NOW(), 'restaurant', 65.0, 66.0)")
            cursor.execute("SELECT * FROM foohyper ORDER BY time DESC LIMIT 100")
            assert cursor.fetchall()[0]["loc"] == "restaurant"
            cursor.execute("SELECT * FROM _timescaledb_internal.get_os_info()")
            assert cursor.fetchone()["sysname"] == "Linux"
            cursor.execute("SELECT * FROM timescaledb_pre_restore()")
            assert cursor.fetchone()["timescaledb_pre_restore"] is True
            cursor.execute("SELECT * FROM timescaledb_post_restore()")
            assert cursor.fetchone()["timescaledb_post_restore"] is True
        finally:
>           always_reconnecting_cursor.execute("DROP EXTENSION timescaledb CASCADE")

It crashes in the last step. always_reconnecting_cursor is basically opening a new connection and cursor and then executes this.

I also think (so no hard data :-() that I only saw it recently, so only with tsdb 2.4. At least twice in two days makes it at least more frequent in the 2.4 release.

@mfundul
Copy link
Contributor

mfundul commented Aug 12, 2021

I wonder if it's related to #3434
This PR was introduced in v2.4.0

@mkindahl mkindahl added bug segfault Segmentation fault labels Aug 17, 2021
@jankatins
Copy link
Author

We now saw it again in a simplified test case:
Create PG, wait to be able to connect, create extensions, run SELECT * FROM timescaledb_pre_restore(). Logs:

pid=180616,user=avnadmin,db=defaultdb,app=[unknown],client=[local] WARNING:  
	WELCOME TO
	 _____ _                               _     ____________  
	|_   _(_)                             | |    |  _  \ ___ \ 
	  | |  _ _ __ ___   ___  ___  ___ __ _| | ___| | | | |_/ / 
	  | | | |  _ ` _ \ / _ \/ __|/ __/ _` | |/ _ \ | | | ___ \ 
	  | | | | | | | | |  __/\__ \ (_| (_| | |  __/ |/ /| |_/ /
	  |_| |_|_| |_| |_|\___||___/\___\__,_|_|\___|___/ \____/
	               Running version 2.4.1
	For more information on TimescaleDB, please visit the following links:
	
	 1. Getting started: https://docs.timescale.com/timescaledb/latest/getting-started
	 2. API reference documentation: https://docs.timescale.com/api/latest
	 3. How TimescaleDB is designed: https://docs.timescale.com/timescaledb/latest/overview/core-concepts
	
	Note: Please enable telemetry to help us improve our product by running: ALTER DATABASE "defaultdb" SET timescaledb.telemetry_level = 'basic';
	
pid=180616,user=avnadmin,db=defaultdb,app=[unknown],client=[local] CONTEXT:  PL/pgSQL function inline_code_block line 12 at RAISE
eDB Background Worker Scheduler : malloc.c:2394: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
pid=180623,user=,db=,app=,client= LOG:  pg_cron scheduler started
pid=180590,user=,db=,app=,client= LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 180620) was terminated by signal 6: Aborted
pid=180590,user=,db=,app=,client= LOG:  terminating any other active server processes
pid=180597,user=,db=,app=,client= WARNING:  terminating connection because of crash of another server process
pid=180597,user=,db=,app=,client= DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
pid=180597,user=,db=,app=,client= HINT:  In a moment you should be able to reconnect to the database and repeat your command.
pid=180616,user=avnadmin,db=defaultdb,app=[unknown],client=[local] WARNING:  terminating connection because of crash of another server process
pid=180616,user=avnadmin,db=defaultdb,app=[unknown],client=[local] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
pid=180616,user=avnadmin,db=defaultdb,app=[unknown],client=[local] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
pid=180616,user=avnadmin,db=defaultdb,app=[unknown],client=[local] CONTEXT:  SQL statement "SELECT _timescaledb_internal.stop_background_workers()"
	PL/pgSQL function public.timescaledb_pre_restore() line 8 at PERFORM
pid=180609,user=postgres,db=_aiven,app=aiven-pruned,client=[local] WARNING:  terminating connection because of crash of another server process
pid=180609,user=postgres,db=_aiven,app=aiven-pruned,client=[local] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
pid=180609,user=postgres,db=_aiven,app=aiven-pruned,client=[local] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
pid=180590,user=,db=,app=,client= LOG:  all server processes terminated; reinitializing

@jankatins
Copy link
Author

And we have a stack trace:

* 1    Thread 0x7f65f2006900 (LWP 180620) 0x00007f65f42aa9e5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f65f42aa9e5 in raise () from /lib64/libc.so.6
#1  0x00007f65f42938a4 in abort () from /lib64/libc.so.6
#2  0x00007f65f42f4eba in __malloc_assert () from /lib64/libc.so.6
#3  0x00007f65f42f6f24 in sysmalloc () from /lib64/libc.so.6
#4  0x00007f65f42f87c2 in _int_malloc () from /lib64/libc.so.6
#5  0x00007f65f42f9754 in malloc () from /lib64/libc.so.6
#6  0x00007f65f42a74f0 in _nl_make_l10nflist.localalias () from /lib64/libc.so.6
#7  0x00007f65f42a5317 in _nl_find_domain () from /lib64/libc.so.6
#8  0x00007f65f42a4bad in __dcigettext () from /lib64/libc.so.6
#9  0x0000000000981cb1 in errmsg ()
#10 0x00007f65e4a07fd8 in entrypoint_sigterm (postgres_signal_arg=15)
    at /srv/jenkins/workspace/user-aiven-deps-f33/deps/timescaledb/build-asl-13/src/loader/bgw_launcher.c:813
#11 <signal handler called>
#12 0x00007f65f42f6949 in sysmalloc () from /lib64/libc.so.6
#13 0x00007f65f42f87c2 in _int_malloc () from /lib64/libc.so.6
#14 0x00007f65f42f9754 in malloc () from /lib64/libc.so.6
#15 0x00007f65f4d31966 in OPENSSL_LH_insert () from /lib64/libcrypto.so.1.1
#16 0x00007f65f4cfde03 in ERR_load_strings_const () from /lib64/libcrypto.so.1.1
#17 0x00007f65f4c75464 in ERR_load_ASN1_strings () from /lib64/libcrypto.so.1.1
#18 0x00007f65f4d2fc86 in ossl_init_load_crypto_strings_ossl_ () from /lib64/libcrypto.so.1.1
#19 0x00007f65f512617f in __pthread_once_slow () from /lib64/libpthread.so.0
#20 0x00007f65f4d6f76d in CRYPTO_THREAD_run_once () from /lib64/libcrypto.so.1.1
#21 0x00007f65f4d2d11b in OPENSSL_init_crypto () from /lib64/libcrypto.so.1.1
#22 0x00007f65f4cfea86 in ERR_get_state () from /lib64/libcrypto.so.1.1
#23 0x00007f65f4d01fcd in ERR_clear_error () from /lib64/libcrypto.so.1.1
#24 0x00007f65f4d2c8e7 in ossl_init_engine_rdrand_ossl_ () from /lib64/libcrypto.so.1.1
#25 0x00007f65f512617f in __pthread_once_slow () from /lib64/libpthread.so.0
#26 0x00007f65f4d6f76d in CRYPTO_THREAD_run_once () from /lib64/libcrypto.so.1.1
#27 0x00007f65f4d2d2fb in OPENSSL_init_crypto () from /lib64/libcrypto.so.1.1
#28 0x00007f65f4d2c6a4 in ossl_init_config_ossl_ () from /lib64/libcrypto.so.1.1
#29 0x00007f65f512617f in __pthread_once_slow () from /lib64/libpthread.so.0
#30 0x00007f65f4d6f76d in CRYPTO_THREAD_run_once () from /lib64/libcrypto.so.1.1
#31 0x00007f65f4d2d266 in OPENSSL_init_crypto () from /lib64/libcrypto.so.1.1
#32 0x00007f65f4f14655 in OPENSSL_init_ssl () from /lib64/libssl.so.1.1
--Type <RET> for more, q to quit, c to continue without paging--
#33 0x00007f65d35b2a15 in _conn_ssl_init () at /srv/jenkins/workspace/user-aiven-deps-f33/deps/timescaledb/build-asl-13/src/net/conn_ssl.c:261
#34 _PG_init () at /srv/jenkins/workspace/user-aiven-deps-f33/deps/timescaledb/build-asl-13/src/init.c:89
#35 0x00000000009869df in internal_load_library.lto_priv ()
#36 0x00000000009872a6 in load_external_function ()
#37 0x00007f65e4a06bc3 in do_load () at /srv/jenkins/workspace/user-aiven-deps-f33/deps/timescaledb/build-asl-13/src/loader/loader.c:648
#38 0x00007f65e4a08d2e in extension_check () at /srv/jenkins/workspace/user-aiven-deps-f33/deps/timescaledb/build-asl-13/src/loader/loader.c:683
#39 extension_check () at /srv/jenkins/workspace/user-aiven-deps-f33/deps/timescaledb/build-asl-13/src/loader/loader.c:666
#40 ts_loader_extension_check () at /srv/jenkins/workspace/user-aiven-deps-f33/deps/timescaledb/build-asl-13/src/loader/loader.c:695
#41 ts_bgw_db_scheduler_entrypoint (fcinfo=<optimized out>)
    at /srv/jenkins/workspace/user-aiven-deps-f33/deps/timescaledb/build-asl-13/src/loader/bgw_launcher.c:935
#42 0x00000000007b6e84 in StartBackgroundWorker ()
#43 0x00000000007c6082 in maybe_start_bgworkers ()
#44 0x00000000007c62cd in sigusr1_handler ()
#45 <signal handler called>
#46 0x00007f65f4365b3a in select () from /lib64/libc.so.6
#47 0x00000000007c72a4 in ServerLoop ()
#48 0x00000000007c8eb2 in PostmasterMain ()
#49 0x00000000004f90d3 in main ()
(gdb) 

@mfundul
Copy link
Contributor

mfundul commented Sep 7, 2021

@mkindahl
Copy link
Contributor

mkindahl commented Sep 7, 2021

The steps triggering this bug are:

  1. Background workers are starting, probably as a result of timescaledb_post_restore and start to initialize SSL.
  2. While the background workers are starting, a sigterm is sent to the background worker, triggering the signal handler. This could be caused by the DROP EXTENSION.
  3. The signal handler is running a malloc (because of the ereport in launcher_sigterm) in the middle of a malloc (called from SSL_library_init), which is not signal-safe.
  4. Second call of malloc triggers the assert above, which tears down the server.

mkindahl added a commit to mkindahl/timescaledb that referenced this issue Sep 7, 2021
Reporting errors using `ereport` can call `malloc()`, which is not
signal-safe. Using `ereport()` in a signal handler can therefore cause
`malloc()` to run nested inside `malloc()` if the termination handler
is called in the middle of a `malloc()` call, which will trigger an
assertion in `malloc()` that will take down the server.

This commit fixes this by using the signal-safe `write_stderr()` inside
the signal handlers for the background workers.

Fixes timescale#3469
@mkindahl mkindahl self-assigned this Sep 7, 2021
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Sep 7, 2021
Reporting errors using `ereport` can call `malloc()`, which is not
signal-safe. Using `ereport()` in a signal handler can therefore cause
`malloc()` to run nested inside `malloc()` if the termination handler
is called in the middle of a `malloc()` call, which will trigger an
assertion in `malloc()` that will take down the server.

This commit fixes this by using the signal-safe `write_stderr()` inside
the signal handlers for the background workers.

Fixes timescale#3469
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Sep 8, 2021
Reporting errors using `ereport` can call `malloc()`, which is not
signal-safe. Using `ereport()` in a signal handler can therefore cause
`malloc()` to run nested inside `malloc()` if the termination handler
is called in the middle of a `malloc()` call, which will trigger an
assertion in `malloc()` that will take down the server.

This commit fixes this by using the signal-safe `write_stderr()` inside
the signal handlers for the background workers.

Fixes timescale#3469
mkindahl added a commit that referenced this issue Sep 8, 2021
Reporting errors using `ereport` can call `malloc()`, which is not
signal-safe. Using `ereport()` in a signal handler can therefore cause
`malloc()` to run nested inside `malloc()` if the termination handler
is called in the middle of a `malloc()` call, which will trigger an
assertion in `malloc()` that will take down the server.

This commit fixes this by using the signal-safe `write_stderr()` inside
the signal handlers for the background workers.

Fixes #3469
svenklemm pushed a commit that referenced this issue Sep 18, 2021
Reporting errors using `ereport` can call `malloc()`, which is not
signal-safe. Using `ereport()` in a signal handler can therefore cause
`malloc()` to run nested inside `malloc()` if the termination handler
is called in the middle of a `malloc()` call, which will trigger an
assertion in `malloc()` that will take down the server.

This commit fixes this by using the signal-safe `write_stderr()` inside
the signal handlers for the background workers.

Fixes #3469
svenklemm pushed a commit that referenced this issue Sep 18, 2021
Reporting errors using `ereport` can call `malloc()`, which is not
signal-safe. Using `ereport()` in a signal handler can therefore cause
`malloc()` to run nested inside `malloc()` if the termination handler
is called in the middle of a `malloc()` call, which will trigger an
assertion in `malloc()` that will take down the server.

This commit fixes this by using the signal-safe `write_stderr()` inside
the signal handlers for the background workers.

Fixes #3469
svenklemm added a commit to svenklemm/timescaledb that referenced this issue Sep 20, 2021
This release contains bug fixes since the 2.4.1 release.
We deem it high priority to upgrade.

**Bugfixes**
* timescale#3437 Rename on all continuous aggregate objects
* timescale#3469 Use signal-safe functions in signal handler
* timescale#3520 Modify compression job processing logic
* timescale#3527 Fix time_bucket_ng behaviour with origin argument
* timescale#3532 Fix bootstrap with regresschecks disabled
* timescale#3574 Fix failure on job execution by background worker
* timescale#3590 Call cleanup functions on backend exit

**Thanks**
* @jankatins for reporting a crash with background workers
* @LutzWeischerFujitsu for reporting an issue with bootstrap
svenklemm added a commit that referenced this issue Sep 20, 2021
This release contains bug fixes since the 2.4.1 release.
We deem it high priority to upgrade.

**Bugfixes**
* #3437 Rename on all continuous aggregate objects
* #3469 Use signal-safe functions in signal handler
* #3520 Modify compression job processing logic
* #3527 Fix time_bucket_ng behaviour with origin argument
* #3532 Fix bootstrap with regresschecks disabled
* #3574 Fix failure on job execution by background worker
* #3590 Call cleanup functions on backend exit

**Thanks**
* @jankatins for reporting a crash with background workers
* @LutzWeischerFujitsu for reporting an issue with bootstrap
svenklemm added a commit that referenced this issue Sep 20, 2021
This release contains bug fixes since the 2.4.1 release.
We deem it high priority to upgrade.

**Bugfixes**
* #3437 Rename on all continuous aggregate objects
* #3469 Use signal-safe functions in signal handler
* #3520 Modify compression job processing logic
* #3527 Fix time_bucket_ng behaviour with origin argument
* #3532 Fix bootstrap with regresschecks disabled
* #3574 Fix failure on job execution by background worker
* #3590 Call cleanup functions on backend exit

**Thanks**
* @jankatins for reporting a crash with background workers
* @LutzWeischerFujitsu for reporting an issue with bootstrap
svenklemm pushed a commit that referenced this issue Sep 20, 2021
Reporting errors using `ereport` can call `malloc()`, which is not
signal-safe. Using `ereport()` in a signal handler can therefore cause
`malloc()` to run nested inside `malloc()` if the termination handler
is called in the middle of a `malloc()` call, which will trigger an
assertion in `malloc()` that will take down the server.

This commit fixes this by using the signal-safe `write_stderr()` inside
the signal handlers for the background workers.

Fixes #3469
svenklemm added a commit that referenced this issue Sep 20, 2021
This release contains bug fixes since the 2.4.1 release.
We deem it high priority to upgrade.

**Bugfixes**
* #3437 Rename on all continuous aggregate objects
* #3469 Use signal-safe functions in signal handler
* #3520 Modify compression job processing logic
* #3527 Fix time_bucket_ng behaviour with origin argument
* #3532 Fix bootstrap with regresschecks disabled
* #3574 Fix failure on job execution by background worker
* #3590 Call cleanup functions on backend exit

**Thanks**
* @jankatins for reporting a crash with background workers
* @LutzWeischerFujitsu for reporting an issue with bootstrap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug segfault Segmentation fault
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants