Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helix job control freezes murex #863

Closed
atagen opened this issue Aug 22, 2024 · 16 comments
Closed

helix job control freezes murex #863

atagen opened this issue Aug 22, 2024 · 16 comments
Labels
bug Unexpected behavior deployed to `develop` Feature built. Currently BETA testing in the `develop` branch

Comments

@atagen
Copy link

atagen commented Aug 22, 2024

Describe the bug:
while using helix editor, trying to background the process with ctrl+z results in murex freezing completely; the terminal buffer is flipped back but any input is simply echoed. killing the helix process has no effect.

Expected behaviour:
helix is suspended, murex provides a shell, helix is resumable

Screenshots:
If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

  • OS, output from uname -a if supported: Linux quiver 6.10.5-xanmod1 #1-NixOS SMP PREEMPT_DYNAMIC Tue Jan 1 00:00:00 UTC 1980 x86_64 GNU/Linux
  • Terminal Emulator: Kitty
  • Murex version, output from version --no-app-name: 6.2.4000
@atagen atagen added the bug Unexpected behavior label Aug 22, 2024
@atagen
Copy link
Author

atagen commented Aug 22, 2024

this appears to be related, from another Go based shell, but with Vim: elves/elvish#988

@atagen
Copy link
Author

atagen commented Aug 22, 2024

as per that thread, also partly helix's fault, I guess? helix-editor/helix#3321

@lmorg
Copy link
Owner

lmorg commented Aug 22, 2024

I’ve been looking into this more general issue off and on for a while now.

I think you’re definitely right that it’s something to do with how Go. I think the issue resides in how it’s forking processes so it isn’t attaching a group ID correctly so when the stop signal is being sent it’s not getting handled correctly. It’s generally not a problem but if a terminal program reads from stdin, it’s looking like both murex and the child process are fighting to read from it.

I’m back from holiday soon so in a couple of days I can spend some proper time reinvestigating this.

Also thank you for the detailed bug report and research. It’s really useful.

@lmorg
Copy link
Owner

lmorg commented Aug 25, 2024

@atagen

The issue does appear to be related to Murex not creating a new PGID. Technically Helix shouldn't be listening for SIGTSTP either, it's not well behaving in that regard, but there are changes we can make to Murex to resolve this.

I've spent a couple of days refactoring and rewriting the job control logic (PR: #865)

Helix now doesn't stop Murex when Helix receives a SIGTSTP, but control isn't passed back to Murex either. I think this is related to the SID (session ID) not being set to Murex. The problem I have hear is the related syscalls fail, which I think is related to the fact that I'm launching test versions from another shell (zsh to be specific) and that shell would have already created a SID. Meaning I might need to create a PTY in instances where the SID fails.

The PR has also only been tested on macOS. The code should work with all Unixes and Linux too, but I haven't tested them. The code will fail to even compile on non-POSIX platforms because I haven't yet written any handling code for Windows, WASM nor Plan 9.

It might take a few more days to get the PR code into a usable state just of macOS. But there's at least progress happening.

@lmorg
Copy link
Owner

lmorg commented Aug 26, 2024

The SID theory was incorrect, though not helped by what appears to be a bug in ps for macOS:
https://superuser.com/questions/1295565/on-macos-why-ps-always-shows-session-id-as-0

I've now got a working prototype pushed to 863/job-control branch. I'm going to hold off merging with develop for a little while because this branch is a pretty heavy re-write of some core process management mechanisms and thus there might be some edge case regression bugs that aren't revealed by the current unit tests.

@atagen
Copy link
Author

atagen commented Aug 26, 2024

I tested the 836/job-control branch - it seems the fix is in there somewhere, but the current behaviour is a mixed bag:

  1. hx .
  2. a) helix may be instantly suspended by murex (it gets one paint in, though)..
    b) ..else, we successfully edit something, and ctrl-z suspends helix appropriately
  3. a) terminal may be stuck in raw mode and broken, with some debug messages interspersed in the output..
    b) ..else, fg <pid> returns us to 2.a

The above usually takes 2 cycles to break, but varies - so slightly non deterministic but consistently reproducible

@lmorg
Copy link
Owner

lmorg commented Aug 26, 2024

I’m seeing some show stopper bugs around different TTYs too.

This fix definitely breaks more things than it fixes.

What happens if you run the following as your first command on the shell?

config set proc force-tty true

?

This forces Murex to assign “real” posix TTYs to processes rather than fake ones that Murex owns. So you lose things like stderr being highlighted red, but it’s plays nicer with posix standards.

If you can try that with the branch release please (it’s currently the only version that sets PGID).

@lmorg lmorg added the in progress Issue is currently being worked on (possibly in a feature branch) label Aug 26, 2024
@atagen
Copy link
Author

atagen commented Aug 27, 2024

Similar behavioural loop up to 3.a, but it no longer gets stuck in raw mode ignoring input. Newlines and cursors break, but I can still interact with murex.

@lmorg
Copy link
Owner

lmorg commented Aug 27, 2024

Awesome. That helps a lot. I'll investigate further. Thank you

@lmorg
Copy link
Owner

lmorg commented Sep 1, 2024

@atagen I've been a little quiet because I haven't really had enough time to focus on this problem again. Until today.

I think I might have cracked it. However there are a couple of caveats:

  1. I've only tested this on macOS. I do have several Linux systems lying about so I will fire up this branch later to test on them
  2. It only works if Murex can be the session leader. This means Murex has to be started with the terminal emulator and NOT started from within another shell running on that terminal. This is a limitation is TTY permissions and directly a result of Helix doing funky things with signals (ie other editors like vim work fine -- it's just Helix that misbehaves).
  3. It only works if force-tty is enabled. Again, other editors behave fine, it's just Helix that's the problem. I've created a wrapper script that enables force-tty for Helix by default. So you don't need to remember to do it shell wide (and, frankly, you wouldn't want to have it enabled shell wide anyway).

Despite all these caveats, as long as Murex is your terminal's shell, everything should work transparently. If it's going to work on Linux as well, that is.

@lmorg
Copy link
Owner

lmorg commented Sep 1, 2024

So it seems to be working for macOS (iTerm) but still hanging on Linux with Konsole and Kitty.

Even more curiously, it's failing in different ways in Konsole vs Kitty.

Working with TTY's is "fun" ☹️

@lmorg
Copy link
Owner

lmorg commented Sep 12, 2024

Update: The solution I've landed on is the following:

  • Murex will still run as it had before, ie without forcing itself to become the session leader nor fiddling with group leaders. This is definitely "incorrect" in terms of what POSIX expects, but every time I try to implement job control "correctly", it then breaks the status reports you get when you press ^z as well as the "force everything to close" behavior Murex has binded with ^\. The only way to have these features is to break from POSIX convention.

  • This then causes Helix to think that Murex hasn't implemented job control even though Murex has. So there's a special start up mode now added to Murex (just been merged into develop branch): -setsid. When Murex is started with that flag, it starts behaving a lot more like a traditional shell.

  • To get Helix to adopt that, it's then just an alias:

    alias helix=exec $MUREX_EXE -quiet -setsid -execute-as exec helix
    

    This basically wraps Helix around another Murex session. It looks ugly, but it works and in a way that doesn't also break existing Murex functionality

  • Did I say it works? That's actually only half true. It works on macOS. Linux is proving a little harder to coerce but I'm confident this new refactor should make it possible.

  • I've been a little bit critical of Helix, and the POSIX standard for job control, but in fairness I deserve a lot of blame too. I'm trying to do things in a non-standard way and in a language (Go lang) which isn't all that good for doing non-standard things at a low level. I don't regret any decisions I've made, but there's no denying that I haven't made it easy for myself. Though one could argue that if I wanted an easy life then I shouldn't have written a shell to begin with 🤔

I'm not sure when I'll get chance to do some proper debugging in Linux. I do have Linux systems to test Murex on, but this is a problem I really need several hours of uninterruptible time if I'm going to properly crack it. So it might be a few more weeks before it's fully solved.

lmorg added a commit that referenced this issue Sep 13, 2024
lmorg added a commit that referenced this issue Sep 13, 2024
@lmorg
Copy link
Owner

lmorg commented Sep 13, 2024

I might have spoken too soon when I said it would be a few more weeks. I think I've nailed it.

Works on both macOS and Linux now -- for me at least. Tested in a couple of terminal emulators too.

Can you have a moment, can you give this a try too please

@lmorg lmorg added deployed to `develop` Feature built. Currently BETA testing in the `develop` branch and removed in progress Issue is currently being worked on (possibly in a feature branch) labels Sep 17, 2024
@lmorg
Copy link
Owner

lmorg commented Sep 18, 2024

@atagen I'll done some more testing at it's definitely working across the various different systems (Linux and Windows, iTerm2, Kitty, Konsole) I have. So I'll be looking to release the current build tonight.

It's possible there might still be some edge cases where things fail, there's so many different configurations and implementations of terminal emulators and TTYs that there'll inevitably be edge cases. So I'll keep this issue open for a little while longer in case you run into further problems.

@atagen
Copy link
Author

atagen commented Oct 4, 2024

seems to be working fine so far - the only hiccup I've had is when opening helix in a TTY it only takes up 3/4 of my screen for some reason, but that's a different issue and hardly a dealbreaker. daily driving murex now, thanks for your very committed support :)

@atagen atagen closed this as completed Oct 4, 2024
@lmorg
Copy link
Owner

lmorg commented Oct 7, 2024

@atagen You are very welcome.

It'd be interested to know more about the 3/4 issue. At first guess, Helix isn't able to get the TTY size so it is defaulting to 80x25. If I recall correctly, the syscall to do that can be a bit unreliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected behavior deployed to `develop` Feature built. Currently BETA testing in the `develop` branch
Projects
None yet
Development

No branches or pull requests

2 participants