Clean up debug.zig, particularly stack trace formating #8228

rgreenblatt · 2021-03-12T23:37:08Z

The idea with this PR is to allow for overriding at a more granular level than just overriding panic. This is useful for the freestanding context so you don't have to implement all of the necessary os functions. It also might be useful when on a supported OS because it allows for more precise overriding. This should allow for easier usage of std.testing, GeneralPurposeAllocator, and panicExtra in the freestanding context.

This PR is now 95% clean up of the debug interface and only adds the ability to overriding the mapping of addresses to symbols and how stack traces are collected. This will allow for easer usage of std.testing and panicExtra, but not GeneralPurposeAllocator.

~~See #8228 (comment) for what a full config looks like.~~

I have also heavily refactored how std.debug is layed out including removing printSourceAtAddress. ~~I believe there are no other changes (other than additions) to the public interface of std.debug, but I could be wrong.~~ This PR also changes the name of getStderrMutex to getPrintMutex ~~(because the writer can now be overriden)~~, makes it so that captureStackTrace takes an allocator and can return an error, and changes the inputs of panicExtra. I also cleaned up a bunch of duplicated code related to dumping stack traces.

This functions adds the ability to override the writer used by std.debug. As such, we must also add the capacity to override detectTTYConfig. ~~I went ahead and also added these capabilites to std.log.~~ See here: #8228 (comment)

Related: #8214

TODO:

Misc design issues (naming, overrides, etc.)
Fixing existing code in the std library which this PR breaks.
Testing (I have done manual testing, but maybe some actual unit tests would be a good idea?)
Docs

lib/std/held_value.zig

lib/std/debug.zig

lib/std/log.zig

lib/std/debug.zig

rgreenblatt · 2021-03-13T00:08:31Z

Note that a bunch of the overrides have now been removed to a separate PR.

Here are all of the new overrides (better names/design would be appreciated):

const std = @import("std");

pub const debug_config = struct {
    pub const SymbolMap = struct {
        const SymbolInfo = std.debug.SymbolInfo;

        pub fn addressToSymbol(self: *@This(), address: usize) !SymbolInfo {
            @compileError("Unimplemented");
        }

        pub fn init(allocator: *std.mem.Allocator) !@This() {
            @compileError("Unimplemented");
        }

        pub fn deinit(self : *@This()) void {
            @compileError("Unimplemented");
        }
    };

    pub fn captureStackTraceFrom(
        allocator: *std.mem.Allocator,
        first_address: ?usize,
        base_pointer: ?usize,
        stack_trace: *std.builtin.StackTrace,
    ) !void {
        @compileError("Unimplemented");
    }
};

LemonBoy · 2021-03-13T08:36:58Z

I generally like this idea but I'd take it a step further.
To print a stack trace one must collect the frame addresses and then map them into symbols, the two steps are separated and independent one from the other. AddressSymbolMapping (bikeshed moment, I prefer SymbolMap) is a nice abstraction for the symbolication phase but the former is still non-overrideable by the user and hardwired to the manual frame-walking StackIterator.
We need a further bit in the API:

fn collectStackTrace(allocator: *std.mem.Allocator, trace *StackTrace) !void;

The allocator part is needed for some strategies (eg. I've implemented a DWARF unwinder and it needs to parse quite a bit of things) while trace is a user-provided buffer we're going to fill.

Nits about AddressSymbolMapping API:

Needs a deinit to free all the allocated memory, this is not important if the program is going to abort but it's useful if some user is only collecting some traces at runtime.
You can move std.Debug.SymbolInfo definition inside the struct and call it Symbol, let's keep the API neat and tidy by having a single symbol info.

lib/std/builtin.zig

lib/std/debug.zig

LemonBoy · 2021-03-13T08:46:07Z

lib/std/debug.zig

 }

 /// TODO multithreaded awareness
-var self_debug_info: ?DebugInfo = null;
+var symbol_at_address: ?AddrSymMapping = null;


While we're at it let's consider if this global cache should stay, the hard part is deciding when to deinitialize it.

rgreenblatt · 2021-03-13T15:17:38Z

AddressSymbolMapping (bikeshed moment, I prefer SymbolMap)

I think this is better, I went ahead and renamed.

rgreenblatt · 2021-03-13T15:42:44Z

To print a stack trace one must collect the frame addresses and then map them into symbols, the two steps are separated and independent one from the other. AddressSymbolMapping (bikeshed moment, I prefer SymbolMap) is a nice abstraction for the symbolication phase but the former is still non-overrideable by the user and hardwired to the manual frame-walking StackIterator.
We need a further bit in the API:
fn collectStackTrace(allocator: *std.mem.Allocator, trace *StackTrace) !void;
The allocator part is needed for some strategies (eg. I've implemented a DWARF unwinder and it needs to parse quite a bit of things) while trace is a user-provided buffer we're going to fill.

This function already (sort of) exists. It's called captureStackTrace:

zig/lib/std/debug.zig

Lines 144 to 150 in ce14bc7

    
           /// Returns a slice with the same pointer as addresses, with a potentially smaller len. 
        
           /// On Windows, when first_address is not null, we ask for at least 32 stack frames, 
        
           /// and then try to find the first address. If addresses.len is more than 32, we 
        
           /// capture that many stack frames exactly, and then look for the first address, 
        
           /// chopping off the irrelevant frames and shifting so that the returned addresses pointer 
        
           /// equals the passed in addresses pointer. 
        
           pub fn captureStackTrace(first_address: ?usize, stack_trace: *builtin.StackTrace) void {

It only captures up to the number of addresses in the passed in StackTrace and it doesn't take in a allocator.

If we always collect the addrs before printing (instead of iterating and printing as we get them), then we will require allocation which isn't actually needed. I don't know how much this matters, but it might be a good idea to avoid allocation where possible in panic because the panic could be due to OOM. I suppose allocation will need to happen for the debug info anyway if that hasn't already been opened. One possible solution would be to have a fixed max number of addresses (maybe 1024) and use an array instead. Note that I think that an allocator should be passed in, I am just wondering if we should try to avoid using it when possible.

Always collecting would be considerably cleaner than the current approach because we could merge together a bunch of different pretty similar functions. The collect function would also need to take first_address: ?usize, fp: ?usize.

rgreenblatt · 2021-03-13T15:45:19Z

Nits about AddressSymbolMapping API:

Needs a deinit to free all the allocated memory, this is not important if the program is going to abort but it's useful if some user is only collecting some traces at runtime.

Done

You can move std.Debug.SymbolInfo definition inside the struct and call it Symbol, let's keep the API neat and tidy by having a single symbol info.

No, this can't be done. The AddressSymbolMapping (now SymbolMap) is a type that the user can override and I don't think we want the user to be able to override the std.debug.SymbolInfo type.

LemonBoy · 2021-03-13T16:06:16Z

This function already (sort of) exists. It's called captureStackTrace:

Yes, I'm proposing to make it part of the API and make it flexible enough to handle the many frame collection strategies.
The plan is to separate the Capture / Symbolicate / Print steps and allow us and the user to mix&match the pieces they need for each platform.

It only captures up to the number of addresses in the passed in StackTrace and it doesn't take in a allocator.

Some non-trivial unwinding mechanisms need an allocator.

If we always collect the addrs before printing (instead of iterating and printing as we get them), then we will require allocation which isn't actually needed

The collection happens in a user provided buffer (the StackTrace), a dumb stack-walker would make no use of the provided allocator.

I don't know how much this matters, but it might be a good idea to avoid allocation where possible in panic because the panic could be due to OOM.

It really depends on the kind of OOM, if you really have no more memory left we're going to crash and burn anyway (or the kernel, or some other piece).

One possible solution would be to have a fixed max number of addresses (maybe 1024) and use an array instead.

Yep, that's the idea.

Always collecting would be considerably cleaner than the current approach because we could merge together a bunch of different pretty similar functions.

That's the plan!

The collect function would also need to take first_address: ?usize, fp: ?usize

This is a bit of a hack needed by the current frame walker and its inability to find its way past the signal handler frame.
I'd propose to split the API into collectStackTrace and collectStackTraceFrom where the latter gets an array filled with the register contents.
The filtering on first_address can be done during the collection phase or in a generic fashion after we've collected all the addresses.

rgreenblatt · 2021-03-17T04:11:18Z

The allocator part is needed for some strategies (eg. I've implemented a DWARF unwinder and it needs to parse quite a bit of things) while trace is a user-provided buffer we're going to fill.

Hmm, passing this allocator is proving to be a real PITA in GeneralPurposeAllocator. We can't pass ourselves (at least not naively): that could lead to unbounded recursion. ~~My currently plan is to pass around a "allow capture" parameter. I'm not sure ugly this will end up being...~~ Actually, I think it will make more sense to use an arena allocator instead.

changes

rgreenblatt · 2021-03-20T19:37:35Z

I split out the extra overrides into a new PR rgreenblatt#1. Reviewing this new PR can wait until this is merged...

rgreenblatt · 2021-03-20T19:39:25Z

lib/std/meta.zig

+/// Get a decl on T by indexing into each field. Returns null if one of the
+/// sub-decls doesn't exist, `compileError` if a sub-decl other than the last one
+/// is a value instead of a type.
+pub fn lookupDecl(comptime T: type, comptime names: []const []const u8) ?(lookupDeclType(T, names) orelse type) {


I also added this utility function. This is generally useful for the if (@hasDecl(root, "name")) root.name else default pattern. This ended up being considerably grosser to implement than I expected and I am considering removing it.

lib/std/debug.zig

rgreenblatt · 2021-03-21T18:41:43Z

lib/std/symbol_map_unix.zig

+    fn readElfDebugInfo(allocator: *mem.Allocator, elf_file: File) !Self {
+        nosuspend {
+            const mapped_mem = try debug_info.mapWholeFile(elf_file);
+            const hdr = @ptrCast(*const elf.Ehdr, &mapped_mem[0]);
+            if (!mem.eql(u8, hdr.e_ident[0..4], "\x7fELF")) return error.InvalidElfMagic;


While looking into how to properly turn things into interfaces, I noticed that a lot of the error handling is a bit odd.
It seems like most of the errors which could be bubbled up via addressToSymbol are really just symptoms of the debug symbols being invalid. For instance, this InvalidElfMagic error would be bubbled up and printed out halting printing the rest of the stack trace. This will only only occur if elf_file is invalid.

So how do we want to handle errors like this? Silently swallow? Bubble them up and stop printing? Note that this function is private and only used for SymbolMap.

As an aside, #2647 would be useful for making the interface simple if wanted to support many "types" of errors.

At the moment, I have just used anyerror. This should (probably?) be changed before merge.

rgreenblatt · 2021-03-27T04:49:16Z

@LemonBoy Are you still reviewing this? Or should some else? (no rush of course).

Vexu · 2021-06-21T16:34:49Z

What's the status of this pr? There hasn't been any work on it in nearly three months, there are a bunch of conflicts, it seems unfinished and it looks like the goal shifted during development. It is also hard to review with so many commits.

Would you mind opening a new pr rebased on master with a bit cleaner history if you are still interested in working on this?

rgreenblatt · 2021-06-21T23:42:39Z

What's the status of this pr? There hasn't been any work on it in nearly three months, there are a bunch of conflicts, it seems unfinished and it looks like the goal shifted during development. It is also hard to review with so many commits.

Would you mind opening a new pr rebased on master with a bit cleaner history if you are still interested in working on this?

I was waiting on @LemonBoy to respond - at some point he mentioned that he might want to start a totally separate PR doing some of the same things and I didn't want to waste effort.

He never responded and I forgot to bump, so here we are...

This is definitely unfinished with several design and implementation issues.
I was running into issues debugging an bug with macos; I couldn't find a sane way to run macos binaries (I couldn't get darling to build on archlinux).

I will probably open a new PR rebased off of master today or tomorrow. I will try to summarize the outstanding issues etc. Hopefully this time the goals of the PR won't shift during implementation...

LemonBoy · 2021-06-22T07:08:13Z

I was waiting on @LemonBoy to respond

Respond to what? GH's PR interface makes it extremely hard to see what's going on.

at some point he mentioned that he might want to start a totally separate PR doing some of the same things and I didn't want to waste effort.

I'm eventually going to add a new unwinder (soon ™️) and will need parts of this changeset or have to implement them myself, coordination is the key.

Hopefully this time the goals of the PR won't shift during implementation...

The initial "Allow for overriding a variety of different stack trace functions" was quite underwhelming as it merely allowed to swap out bits and pieces of the debug infrastructure in a rather inelegant and inflexible way. Having a wee bit of experience with Zig and its debug infrastructure I tried to explain how to fix some of the shortcomings and, after some back and forth, the code is starting to look nice (even though I'm still not sold on the use of lookupConfigItem).

That's the whole point of a code review, discussing the changes with the big picture in mind. If you just want a "LGTM 👍" then ask for another reviewer.

andrewrk · 2021-06-25T01:07:29Z

This is definitely unfinished with several design and implementation issues.

Alright I think it's time for this to be closed and any improvements extracted out of it into smaller, more easily reviewable and mergeable pieces. I look forward to those PRs if you still have the time & energy to make them :)

rgreenblatt · 2021-06-27T00:37:36Z

See above three mentioned PRs. Sorry about the wait.

Those PRs don't include quite everything from here, but I will wait for #9252 to be merged or at least more polished before creating the final PR(s).

rgreenblatt added 3 commits March 11, 2021 23:52

out_stream -> writer

ca15a64

use name addressToSymbol

7020fcd

Heavily refactor debug to better handle dumping stack traces

99cead9