Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace LLVM with Cranelift #674

Closed
yorickpeterse opened this issue Dec 20, 2023 · 26 comments
Closed

Replace LLVM with Cranelift #674

yorickpeterse opened this issue Dec 20, 2023 · 26 comments
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module

Comments

@yorickpeterse
Copy link
Collaborator

yorickpeterse commented Dec 20, 2023

Description

We currently use LLVM, mainly because at the time of switching to a native code compiler Cranelift wasn't quite ready yet. I think since then it released the features we need (e.g. checked integer arithmetic). Switching would significantly simplify distribution and improve compile times. This issue is meant to keep track of what to keep in mind, what we'd have to change, etc.

It's important to keep in mind that we're not going to support both LLVM and Cranelift. Instead, the long-term goal is to write optimizations ourselves and use Cranelift/whatever to just turn that into machine code.

Related work

No response

Related issues

Missing Cranelift features

@yorickpeterse yorickpeterse added feature New things to add to Inko, such as a new standard library module compiler Changes related to the compiler labels Dec 20, 2023
@yorickpeterse
Copy link
Collaborator Author

One issue I encountered last time, and which still seems present, is that it's not clear how one would produce basic debug information (enough to produce a stacktrace). I asked about this in bytecodealliance/wasmtime#5141, but this didn't yield a simple/clear answer.

@yorickpeterse
Copy link
Collaborator Author

We use structure returns in a few places, mainly to interface with the Rust runtime. It seems Cranelift supports this, though it's not clear to what degree:

@yorickpeterse
Copy link
Collaborator Author

Tail calls are currently not supported: bytecodealliance/wasmtime#1065. This doesn't matter much though as we don't make use of them, and LLVM's support for them is spotty as well.

@yorickpeterse
Copy link
Collaborator Author

LLVM has the getelementptr instruction to automatically calculate field offsets. In Cranelift you need to manually calculate these and pass them to the load() instruction. I'm not sure this is much of an issue though, as we can calculate these at compile-time and probably just pass them as a u32/whatever literal to load().

@yorickpeterse
Copy link
Collaborator Author

For stack slots, Cranelift has the following methods:

The declare_var defines a variable and its type. def_var assigns a value to a variable, and use_var marks it as used. The method use_var also loads the value of the variable, so no extra load() is needed for that. Cranelift then turns this into SSA for you.

@yorickpeterse
Copy link
Collaborator Author

Digging through the Cranelift code and associated projects, it seems producing debug info is still painful, requiring knowledge/generating of raw DWARF data through the gimli crate. I don't have such knowledge at the moment, so it will likely take a while before we can swap out LLVM with Cranelift.

@yorickpeterse
Copy link
Collaborator Author

Per https://docs.rs/cranelift-codegen/latest/cranelift_codegen/ir/trait.InstBuilder.html#method.atomic_load and related methods, it seems Cranelift enforces sequential ordering for atomics. This could have a rather significant impact as we use atomics in a few places, but we absolutely don't need nor want sequential consistency there.

@yorickpeterse
Copy link
Collaborator Author

https://github.com/bytecodealliance/wasmtime/blob/b583c54fda13b53dea362861125dd1e2ced1381d/cranelift/codegen/src/isa/x64/lower.isle#L3213-L3223 states that at least on x86, atomics are just regular loads and stores (optionally with a fence for stores). That's not too bad, but it would be nice to have precise control over this.

@yorickpeterse
Copy link
Collaborator Author

Cranelift doesn't have an equivalent of llvm.powi. This means we need to use a runtime function call for that.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Dec 22, 2023

To define constant values, it seems you have to use declare_data_in_data to define it, then use DataDescription::define() to write the value.

You then "import" to the global into a function using declare_data_in_func, and load it using global_value, at least by the looks of it.

@yorickpeterse
Copy link
Collaborator Author

Calling a function is done as follows:

  • Build the instruction using e.g. call
  • Obtain the results using ins_results

@yorickpeterse
Copy link
Collaborator Author

Function imports are done using declare_function to import the signature, followed by declare_func_in_func to reference it.

@yorickpeterse
Copy link
Collaborator Author

In the LLVM backend, we rely on the mem2reg pass to construct the IR in SSA form. This requires allocating temporaries using alloca() at the start of a function. For this we use llvm::builder::Builder::new_stack_slot(), which injects the alloca() calls into the start of the function, regardless of where we currently are.

In Cranelift it seems this isn't necessary, as one can just call declare_var at-will. This means we could probably get rid of new_stack_slot entirely.

@yorickpeterse
Copy link
Collaborator Author

Jump tables/switches are done using br_table. The jump tables are created using create_jump_table.

I believe our Switch instruction always uses monotonically increasing numbers that start at zero, so we should be able to lower that directly to the br_table instruction.

@bjorn3
Copy link

bjorn3 commented Dec 22, 2023

I believe our Switch instruction always uses monotonically increasing numbers that start at zero

If they don't you can use cranelift_frontend::Switch to generate a combination of jump tables and a decision tree. It is almost certainly not the most optimal, but it should save some effort and any optimizations to it would benefit cg_clif too :)

@bjorn3
Copy link

bjorn3 commented Dec 22, 2023

To define constant values, it seems you have to use declare_data_in_data to define it, then use DataDescription::define() to write the value.

declare_data_in_data is for referencing a global variable from another global variable. To define one you need to first use declare_data and then define_data with a DataDescription as argument. The DataDescription can be filled with DataDescription::define and then you can use the write_data_addr and write_function_addr with the results of declare_data_in_data and declare_func_in_data respectively to reference other global variables and functions from this global variable.

@yorickpeterse
Copy link
Collaborator Author

@bjorn3 Thanks for the details! 😃

@yorickpeterse
Copy link
Collaborator Author

Per Zulip and the documentation: the Value type only represents simple types such as integers and floats. For stack allocated structures, one needs to use create_sized_stack_slot and related methods and types. The stack slots registered for this are not reused through some sort of clever mechanism. This means we'd need to come up with some sort of scheme to reuse slots no longer in use, otherwise we might end up using more stack space than necessary.

I'm not sure how we'd implement that though, because when generating stack loads/stores it's not clear if the slot is still used after that. This would probably require some form of liveness analysis.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Dec 23, 2023

Related to that:

We'll manually need to generate the correct signatures for functions that take or return structures. This in turn needs to take into account the structure sizes, as structures <= 8 bytes are passed directly while larger ones use an extra argument.

We could simplify things by dropping support for structure returns/arguments, but this requires adjusting the runtime library in a whole bunch of places, which I'd like to avoid.

@yorickpeterse
Copy link
Collaborator Author

I think my biggest concern with Cranelift so far is that it's perhaps a bit too low-level. For example, we need to calculate structure sizes, field offsets, etc ourselves. Because of this, I suspect the amount of code needed in the backend would roughly equal that of our current LLVM backend, as we'd simplify things in one area but complicate them in another. For example, we have a bunch of code to specify the structure sizes and what not such that LLVM can calculate the right offsets. I had initially hoped that most of this could go away with Cranelift, but it turns out we'd have to keep most of it.

I think Cranelift would really benefit from a slightly more high-level wrapper that abstracts over the various platform differences and what not, basically LLVM--, but such a library doesn't exist unfortunately.

I'm going to experiment with doing LLVM generation in parallel, to see if/to what degree that's even possible with Inkwell. I'll also need to really think hard if Cranelift is worth the trouble in its current state.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Dec 24, 2023

Running LLVM/Inkwell in parallel appears tricky. In spite of some of the types being Sync, it appears that even if most of the data needed is created on a per-thread basis, Inkwell/LLVM segfaults. I'm guessing this is because we do share some StructType structures between threads in my quick hack, and those are probably mutated in some way.

Unfortunately, due to how Inkwell/LLVM is set up, we can't really work around sharing that particular data as it defines all the structure sizes and what not, and creating those for every thread is a waste.

If we do create that data on a per-thread basis, and ignore the code potentially being incorrect for a moment, compile times for the standard library tests go down from 3.3 seconds to 1.15 seconds with 8 threads. That's only 3x faster but with 8x the concurrency, which is disappointing. Note that for this experiment my queue is just Mutex<Vec<_>> and we could do better, but I'd be surprised if changing that yields another 5x improvement.

@yorickpeterse
Copy link
Collaborator Author

To add to the above: Inkwell exposes basically everything as immutable types/methods, even though LLVM mutates things under the hoods. This means that even if we were to get things to work in parallel, it may cause problems in the future if LLVM starts to mutate/share data where this isn't expected.

@yorickpeterse
Copy link
Collaborator Author

Regarding stack slots: based on how we use structure returns right now, and the fact that Inko values are heap allocated (apart from Int/Float/etc), I think stack slot reuse (or more precisely a lack thereof) isn't a big deal for the time being.

@yorickpeterse
Copy link
Collaborator Author

Another challenge: our FFI supports variadic function arguments, something that's needed to interface with certain C libraries (e.g. cURL). Per this issue Cranelift has no support for this whatsoever, and per this comment you basically have to emulate it.

I'm starting to lean more towards sticking with LLVM for the time being, and focusing more on making it parallel and incremental where possible, seeing as transitioning to Cranelift is likely going to require a lot of work.

@yorickpeterse yorickpeterse added this to the Future ideas milestone Dec 24, 2023
@yorickpeterse yorickpeterse removed this from the Future ideas milestone Aug 5, 2024
@yorickpeterse
Copy link
Collaborator Author

I'm going to close this as there are no plans to do this in the next several years at least, in addition to Cranelift lacking various features that we need (as discussed in this thread).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module
Projects
None yet
Development

No branches or pull requests

2 participants