-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random errors occuring with code that is not being executed (!!??) #623
Comments
Here's another example that shows the CPU behaving strangely. There's this function:
I realized I had an off-by-one error and I just replaced the loop to be
By some miracle, I removed the printf call and I could test the loop conditions change and it solved the bug on my own code, but I lost the ability to printf-debug. And of course it still doesn't make any sense though... |
Hey @agamez! I am not sure if this is really caused by a hardware bug... This damaged massage looks like some problem with the stack:
Maybe it is a problem on the software side. Maybe a crash of the stack and the bss segment or some rogue code that overrides parts of the program code. Anyway, could you provide a minimal example program? |
Hi @stnolting ! I've tried to reduce the offending code as much as possible, and I've reached a state that is quite small. I'm observing right now a different error, which isn't triggering a TRAP, but I think you are right and it's got something to do with a stack overflow bug because it's corrupting serial output. The current code should just print: I've tried to make the code even smaller, but right now it doesn't matter what I change, because removing a variable or struct members, or removing a loop, or reducing the number of arguments of the function or whatever always makes it so the proper string is seen on the UART. Even moving the function on the second file to the main one, makes the error disappear. The only relatively unusual thing there is I believe is the union, but it's not that rare, isn't it? Removing it makes the message turn OK, but also does it any other change, so I'm not sure if that's the culprit. Anyway, since there are a couple of files I created a repo to store them, although I'm including the files here too for completeness. Thanks a lot! |
I just tested the code you have attached to your last comment. I just modified the NEORV32 home folder path in the makefile and executed a simple GHDL simulation using the default testbench (latest version of the core):
The very last console line is the actual UART output, which looks fine, right? So maybe there is configuration issue (hardware or software) on your side?!
|
Hi!
I don't know why I didn't think of simulation it before... My simulation results are not the same as yours, instead they match what I see on the terminal.
It looks like it grew tired of simulating it all :) But it's still visible that it was writing neorv32_application_image.vhd.txt
Indeed, that is the expected output
I'm using v1.8.4. I've just upgraded to the latest version and the same thing is happening. Although this time the implementation didn't reach timing closure, so I can't be entirely sure of these results.
Nope, neither the size of memory, they are the default
The one I built from https://github.com/riscv-collab/riscv-gnu-toolchain using your instructions in the user manual. It is version:
I hope this all helps, thanks a lot! |
Right, here's something strange. If I simulate with your command line: Then my simulation output is exactly as yours: I see the whole string on the console and the message is the full If I simulate without defining DUART0_SIM_MODE:
Then I see what I wrote on my post above, the characters one by one and the message that is being output is wrong:
Changing BAUD_RATE to 115200 makes it even worse and shows just nonsense:
Is this just a timing issue? But reducing UART speed to 1200 does not make it any better:
But setting it to 1200 and simulation with
After seen all this... I'm also wondering if this is the underlying cause to my traps or if they are two different things. In any case, it'd be good to solve this. |
I checked your code a little bit further. You are right, using Adding When the UART simulation mode is disabled, all UART data is sent to "simulation receiver" using an actual physical transmission based on the configured BAUD rate: https://github.com/stnolting/neorv32/blob/main/sim/simple/uart_rx.simple.vhd Note that the baud rate of this simulation receiver is hardwired to 19200. If you change the baud rate on the software side then you will get garbage as output. The simulation mode is not affected by this as there is no "real/physical" transmission. Anyway, I had a close look at the actual executable being executed. And it seems like we have a problem there. If the executable (or more specific, the
Here, address So this is a problem with the linker script. I think I found a way to fix this, but I need some more testing. I will file a PR when everything is ready. Thank you very much for finding this (heisen-)bug! This is a very interesting corner case you have found! 😉 |
That definitely explains why a working program wouldn't fail ever but a wrong one would fail even when changing a region of code that was never reached. With a little bit of luck, only the strings were misaligned, but code could still execute, otherwise instructions would all be mangled up and the existence of const pointers to functions probably made it easier to spot.
Well, my. Thanks to you, definitely. I wouldn't have ever guessed any of that. I truly appreciate your assistance. The quality of your work is more than remarkable! |
I have started implementing a fix for this in #626. |
Hi!
I'm experiencing some strange bugs that have me completely flabbergasted. I am writing a relatively simple program that makes use of an external interrupt to reprogram some SPI devices. The software also presents an ASCII based user interface through the serial port. I don't think any of this is out of the ordinary, but typing it just presents the characteristics of the configuration of the neorv32.
The architecture of the software, however, is maybe a little bit complex because I'm using OOP concepts (structs with pointers to functions, vtables, etc), so it's a bit difficult sometimes which function is being called. Now, I'm trying to printf-debug this UART interface because I don't have an available JTAG connection (this is implemented on an iCEBreaker and I don't have external hardware to connect JTAG to the output pins). I have my software running flawlessly, reacting perfectly fine to the interrupts and all is fine, but I want to debug this little function, which is only called when I type a specific command through the serial port. If there's no interaction from my part, the CPU will continue to react to the external stimulii and processing interrupts.
So I added that little printf at the beginning of the function. And, when I do that and load that new neorv32_exe.bin file, the CPU spits this out through the serial port without even having been requested find_reg function to be executed:
So, changes in pieces of code that are not being executed produce some kind of error on the CPU. Even the message it spits out is somewhat mangled.
I've been experiencing effects like this for the last couple of days and I've tried almost everything I can think of. This is running on a ice40up5k, which synthesizes with an estimate of 24.81MHz and I've tried running the CPU at just 16MHz and exactly the same thing is happening, so it doesn't seem like a timing issue.
Not being able to run gdb on this is definitely hindering my progress, but my guess is this would probably be a heisenbug which wouldn't show itself when being debugged, as I don't believe this to be a software issue but a hardware (RTL) one.
Any and all kinds of help would be greatly appreciated. Thanks!
The text was updated successfully, but these errors were encountered: