Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sandbox to run WAL redo routines in #36

Closed
hlinnaka opened this issue Mar 9, 2021 · 18 comments
Closed

Implement sandbox to run WAL redo routines in #36

hlinnaka opened this issue Mar 9, 2021 · 18 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver
Milestone

Comments

@hlinnaka
Copy link
Contributor

hlinnaka commented Mar 9, 2021

No description provided.

@knizhnik
Copy link
Contributor

I wonder what kind of attacks we want to address here?
What kind of vulnerabilities and injections we are expecting here?

Do we assume that Postgres core (without user extension) always produce valid WAL records?

If we are talking about extension, then do we plan to support only well known (verified) extension or any user defined extension? How can extension produce incorrect WAL records? Does extension use custom WAL records (or logical WAL records) and we expect that Postgres is not checking their correctness? Or do we expect that "Byzantine" extension will get direct access to WAL files and intentionally corrupt them? If so, how we are going to protect ourselves from DooS kind of attacks, when WAL records are formally valid, but cause some very expensive operations consuming a lot of disk and CPU resources? Sandbox will not help much in this case, doesn't it?

Final question: if we rewrite redo handlers in some safe language, like Rust, then sandbox will not needed, will it? I looked at redo handlers and it seems to be non-trivial, because them are using a lot AM specific code. But it should be more efficient that running Postgres in VM.

As far as I understand the primary idea of multi-tenant storage nodes was to reduce per-client overhead. But if we have to proceed WAL records of each client in separate VM, will it be even more expensive than avoid multi-tenancy support at storage level as well as at computation nodes, and running separate Pager process for each client?

@kelvich
Copy link
Contributor

kelvich commented Mar 13, 2021

Do we assume that Postgres core (without user extension) always produce valid WAL records?

User can buy exploit for yet-not-disclosed CVE and take over the processing node -- after that we can expect malicious WAL to come into the storage node. It may just crash it or take over the node if there is some bug in redo. Knowing the code that we are running at processing node (postgres redo) this user may look for wrong boundary checks around wal reading and just execute his own code. So IMO the whole attack is not a theoretical one -- attacker will just need to wait few month until next CVE to write malicious WAL and finding exploitable problem in the redo code should also not be a big problem.

if we rewrite redo handlers in some safe language, like Rust, then sandbox will not needed, will it?

Rust safety means that type system when it is properly used prevents data races and dangling pointers. But array boundary checks are still on programmer -- and that can lead to the problems.

it should be more efficient that running Postgres in VM

I believe we may have same performance between host page server <-> postgres in vm as it is between two processes talking to each other via shared memory queue. We actually may avoid running OS in vm at all (not sure that we should, but we can) -- just some parts of postgres reading one memory queue and writing to another one, no files, no processes, etc.

Some links:

https://firecracker-microvm.github.io/
https://mirage.io/
https://qemu.readthedocs.io/en/latest/system/ivshmem.html

@knizhnik
Copy link
Contributor

knizhnik commented Mar 13, 2021

But array boundary checks are still on programmer -- and that can lead to the problems.

I am not Rust expert, but it seems to be not true.
Rust performs out-of-bounds checking at runtime.
May be there is some compiler switch which allows to disable them or may be compiler can avoid them when it can prove that there is no out-of-bounds access, but by default them are performed:
https://users.rust-lang.org/t/bound-checking-really-present/21809
https://users.rust-lang.org/t/how-to-avoid-bounds-checking/4433/4

@kelvich
Copy link
Contributor

kelvich commented Mar 13, 2021

Yes, it is usually switched on in a debug build and switched off for release build. We may keep that for release, but working with raw memory means unsafe { } code blocks with all guarantees off. To be able to write something to rust array/vector from WAL you would need to first read len from raw buffer, then memcpy() some data of that len. After that rust runtime can do boundary checks. But will you believe this len in a first place? -- that is what I meant by manual boundary checks. So I don't understand how rewriting redo can generally help. It may be harder or impossible to penetrate such a system, but causing crash/oom can be done instead.

@kelvich
Copy link
Contributor

kelvich commented Mar 13, 2021

Thinking about that a bit more it seems that we may skip calling VM redo code in getPage requests and use redo only to do periodical WAL chains compaction. So on getPage request we may return page and few records on top of it and use VM only to full page image from time to time (to avoid sending that from compute node) when WAL chain becomes big.

@adsharma
Copy link
Contributor

XLogInsert(RmgrId rmid, uint8 info) - can we get a frequency histogram of <rmid, info> for a typical WAL?

That'll help us determine how much needs to be implemented in rust and where we can defer to redo code running elsewhere?

@kelvich kelvich transferred this issue from neondatabase/postgres Apr 15, 2021
@kelvich
Copy link
Contributor

kelvich commented May 10, 2021

One wild idea that came once during a talk with @ericseppanen is that if we will be able to compile and run postgres for a wasm target we also may use it as a sandbox to apply WAL and while wasm should be generally slower absence of inter-vm communication may speed things up. I don't expect that this approach would be really working for variety of reasons but it worth of investigating.

@hlinnaka
Copy link
Contributor Author

Yeah, webassembly is worth investigating. Another interesting project I bumped into recently is https://gvisor.dev/. Also, there's https://github.com/rust-vmm for a full VM-level sandbox

@ericseppanen
Copy link
Contributor

I threw out the wasm idea as a random crazy thought, but having thought about it more I don't think sandboxing is a great idea.

I would think that WAL processing is performance-critical. Having to cross a sandbox boundary seems like a lot of overhead.

If we could reimplement WAL handling in Rust, we would get a memory safety guarantee. That code would have to be free of Rust panics (which could be used as DoS attacks), but that's not too difficult.

There are some earlier comments above talking about bounds checks that I think are incorrect: Rust bounds checks are always enabled: there is no way (short of a buggy library using unsafe code) to disable bounds checks in release mode or in any other mode. The only things that I can think of that are off by default in release mode are integer overflow checks, and they can be turned back on in Cargo.toml.

@hlinnaka
Copy link
Contributor Author

If we could reimplement WAL handling in Rust, we would get a memory safety guarantee. That code would have to be free of Rust panics (which could be used as DoS attacks), but that's not too difficult.

Yes. But then we would have to reimplement the WAL handling in Rust. And keep maintaining that for every PostgreSQL major release. Sometimes there have been bug fixes in minor releases too. It's not a huge amount of code, but it's still a fair amount and it's pretty deep down and scattered in the Postgres sources. I would really rather not go there if we can find another way.

The first TODO here is to set up some performance testing, though, to see how much overhead we're really talking about. It might be a non-issue, given all the other overhead, or it could be super-critical. Or maybe it can be mitigated e.g. by writing a full-page image WAL record every 1000 modifications to a page in the compute node, so that there's less WAL to replay in the common cases.

On a more philosophical level, I wish we can find a good, fast, way to run safely Postgres code in page server and elsewhere. Aside from replaying WAL, that will open a lot of possibilities like doing logical decoding, and running VACUUM and other background jobs in the background.

@ericseppanen
Copy link
Contributor

And keep maintaining that for every PostgreSQL major release. Sometimes there have been bug fixes in minor releases too.

We could have fuzz tests that run against every release. The fuzzer could generate inputs, run them through postgres, run the same inputs through our code, and verify the result is the same.

It's not a huge amount of code, but it's still a fair amount and it's pretty deep down and scattered in the Postgres sources.

Would we need to extract the "deep down" parts into a standalone module, to run in the sandbox? Or is the sandboxed code a full copy of postgres?

@ericseppanen
Copy link
Contributor

I do agree that we should understand the performance first. I was guessing that this is performance-critical-- if that's not true, then I don't think heroic "rewrite it in rust" efforts are called for :)

@ericseppanen
Copy link
Contributor

Another idea to add to the list here: Linux SECCOMP_SET_MODE_STRICT will allow you to run a process that can only read & write from an already-opened file descriptor. I think the intent is that the sandboxed program inherits some pipe fds for communication, and isn't allowed to do anything else.

@ericseppanen
Copy link
Contributor

I chatted with @kelvich about this. I will try to summarize:

  • wasm will take an unknown amount of effort to work. I'm scared of this becoming an infinite time-sink.
  • It's probably too early to talk about pushing this work to a cloud-level VM.
  • Firecracker VMs look OK. Should investigate.
  • containers don't provide great security guarantees (large syscall surface area). dockerd is overkill, runc is a little under-developed and hard to get right.
  • Linux seccomp looks OK. Should investigate.

We didn't talk much about the "native Rust" option. My current thinking:

"Rewrite it in rust" may take a lot of time, and will need constant maintenance to stay in sync with new Postgres releases. So even if we might like this as a long-term option, it would be nice to have something easy in the short term. Plus, even if we do a Rust wal_redo implementation, it would be good to keep the C wal_redo_main so we can verify that the two implementations return the same result.

@hlinnaka hlinnaka added this to the Launch milestone Jun 4, 2021
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Jun 4, 2021

neondatabase/postgres#52 will take care of this, with seccomp

@kelvich kelvich added the c/storage/pageserver Component: storage: pageserver label Jul 8, 2021
@mklgallegos
Copy link

hi @funbringer it looks like this issue is waiting on a code review from you. Not trying to bother, just wanted to check in since this issue is schedule to be completed in milestone v0.1 (concludes at end of the month).

Let me know if this looks like something that will slip past the end of the month.

@funbringer
Copy link
Contributor

funbringer commented Jul 16, 2021

@mklgallegos Eric's patch has been superseded by neondatabase/postgres#56, and I've already pushed a bunch of bug-fixes in neondatabase/postgres#58. There's also an on-going discussion here, which will hopefully be resolved after I've finished yet another patch.

@mklgallegos
Copy link

Hi @funbringer - this item was put into the done category during the milestone v0.2 planning session on 2021-07-30. I'm closing out. Please re-open if I am mistaken.

problame added a commit that referenced this issue Jan 14, 2025
This approach here doesn't work because it slows down all the responses.

the workload() thread gets stuck in auth, prob with zero pipeline depth

  0x00007fa28fe48e63 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  (gdb) bt
  #0  0x00007fa28fe48e63 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x0000561a285bf44e in WaitEventSetWaitBlock (set=0x561a292723e8, cur_timeout=9999, occurred_events=0x7ffd1f11c970, nevents=1)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/ipc/latch.c:1535
  #2  0x0000561a285bf338 in WaitEventSetWait (set=0x561a292723e8, timeout=9999, occurred_events=0x7ffd1f11c970, nevents=1, wait_event_info=117440512)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/ipc/latch.c:1481
  #3  0x00007fa2904a7345 in call_PQgetCopyData (shard_no=0, buffer=0x7ffd1f11cad0) at /home/christian/src/neon//pgxn/neon/libpagestore.c:703
  #4  0x00007fa2904a7aec in pageserver_receive (shard_no=0) at /home/christian/src/neon//pgxn/neon/libpagestore.c:899
  #5  0x00007fa2904af471 in prefetch_read (slot=0x561a292863b0) at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:644
  #6  0x00007fa2904af26b in prefetch_wait_for (ring_index=0) at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:596
  #7  0x00007fa2904b489d in neon_read_at_lsnv (rinfo=..., forkNum=MAIN_FORKNUM, base_blockno=0, request_lsns=0x7ffd1f11cd60, buffers=0x7ffd1f11cd30, nblocks=1, mask=0x0)
      at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:3024
  #8  0x00007fa2904b4f34 in neon_read_at_lsn (rinfo=..., forkNum=MAIN_FORKNUM, blkno=0, request_lsns=..., buffer=0x7fa28b969000) at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:3104
  #9  0x00007fa2904b515d in neon_read (reln=0x561a292ef448, forkNum=MAIN_FORKNUM, blkno=0, buffer=0x7fa28b969000) at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:3146
  #10 0x0000561a285f1ed5 in smgrread (reln=0x561a292ef448, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7fa28b969000) at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/smgr/smgr.c:567
  #11 0x0000561a285a528a in ReadBuffer_common (smgr=0x561a292ef448, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7ffd1f11cf1b)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:1134
  #12 0x0000561a285a47e3 in ReadBufferExtended (reln=0x7fa28ce1c888, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:781
  #13 0x0000561a285a46ad in ReadBuffer (reln=0x7fa28ce1c888, blockNum=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:715
  #14 0x0000561a2811d511 in _bt_getbuf (rel=0x7fa28ce1c888, blkno=0, access=1) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/nbtree/nbtpage.c:852
  #15 0x0000561a2811d1b2 in _bt_metaversion (rel=0x7fa28ce1c888, heapkeyspace=0x7ffd1f11d9f0, allequalimage=0x7ffd1f11d9f1) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/nbtree/nbtpage.c:747
  #16 0x0000561a28126220 in _bt_first (scan=0x561a292d0348, dir=ForwardScanDirection) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/nbtree/nbtsearch.c:1465
  #17 0x0000561a28121a07 in btgettuple (scan=0x561a292d0348, dir=ForwardScanDirection) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/nbtree/nbtree.c:246
  #18 0x0000561a28111afa in index_getnext_tid (scan=0x561a292d0348, direction=ForwardScanDirection) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/index/indexam.c:583
  #19 0x0000561a28111d14 in index_getnext_slot (scan=0x561a292d0348, direction=ForwardScanDirection, slot=0x561a292d01a8) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/index/indexam.c:675
  #20 0x0000561a2810fbcc in systable_getnext (sysscan=0x561a292d0158) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/index/genam.c:512
  #21 0x0000561a287a1ee1 in SearchCatCacheMiss (cache=0x561a292a0f80, nkeys=1, hashValue=3028463561, hashIndex=1, v1=94670359561576, v2=0, v3=0, v4=0)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/catcache.c:1440
  #22 0x0000561a287a1d8a in SearchCatCacheInternal (cache=0x561a292a0f80, nkeys=1, v1=94670359561576, v2=0, v3=0, v4=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/catcache.c:1360
  #23 0x0000561a287a1a4f in SearchCatCache (cache=0x561a292a0f80, v1=94670359561576, v2=0, v3=0, v4=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/catcache.c:1214
  #24 0x0000561a287be060 in SearchSysCache (cacheId=10, key1=94670359561576, key2=0, key3=0, key4=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/syscache.c:817
  #25 0x0000561a287be66f in GetSysCacheOid (cacheId=10, oidcol=1, key1=94670359561576, key2=0, key3=0, key4=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/syscache.c:1055
  #26 0x0000561a286319a5 in get_role_oid (rolname=0x561a29270568 "cloud_admin", missing_ok=true) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/adt/acl.c:5251
  #27 0x0000561a283d42ca in check_hba (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/libpq/hba.c:2493
  #28 0x0000561a283d5537 in hba_getauthmethod (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/libpq/hba.c:3067
  #29 0x0000561a283c6fd7 in ClientAuthentication (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/libpq/auth.c:395
  #30 0x0000561a287dc943 in PerformAuthentication (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/init/postinit.c:247
  #31 0x0000561a287dd9cd in InitPostgres (in_dbname=0x561a29270588 "postgres", dboid=0, username=0x561a29270568 "cloud_admin", useroid=0, load_session_libraries=true, override_allow_connections=false,
      out_dbname=0x0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/init/postinit.c:929
  #32 0x0000561a285fa10b in PostgresMain (dbname=0x561a29270588 "postgres", username=0x561a29270568 "cloud_admin") at /home/christian/src/neon//vendor/postgres-v16/src/backend/tcop/postgres.c:4293
  #33 0x0000561a28524ce4 in BackendRun (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/postmaster/postmaster.c:4465
  #34 0x0000561a285245da in BackendStartup (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/postmaster/postmaster.c:4193
  #35 0x0000561a285209c4 in ServerLoop () at /home/christian/src/neon//vendor/postgres-v16/src/backend/postmaster/postmaster.c:1782
  #36 0x0000561a2852030f in PostmasterMain (argc=3, argv=0x561a291c5fc0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/postmaster/postmaster.c:1466
  #37 0x0000561a283dd987 in main (argc=3, argv=0x561a291c5fc0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/main/main.c:238
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

7 participants