Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Support for gcode sending with line numbers and checksums to improve communicaton reliability #2839

Closed
rondlh opened this issue Sep 20, 2023 · 4 comments · Fixed by #2889
Labels
enhancement New feature or request

Comments

@rondlh
Copy link

rondlh commented Sep 20, 2023

Recently I've tested some serial communication algorithms by comparing the original gcode to the gcodes received by the motherboard running Marlin. I did that by logging the received commands, filtering out some lines (temp reports, M220, M221 etc.) and comparing the files with WinMerge.
To my surprise I found data corruption, which typically showed as a character missing at the host side. I never saw that any data got changed.
The resulting Gcode was still valid, so Marlin (almost) never reported any warnings/errors.
A typical line of gcode: "G1 X117.317 Y86.664 E190.36879"
When considering losing 1 character, then only in a few cases Marlin will complain:
Missing first "1" (invalid gcode command), or missing last "." (too long extrusion prevented if enabled in Marlin).
So the user might stay unaware of any issues.

I found missing character happened about once every 6-10MB of Gcode send at 250K baud.
It still happened at 115200, but a lower error rate.

Online I found that 2-wire serial communication cannot be considered to be reliable. Lowering the transmission speed can help, but it only reduces the data error rate. So the recommendation is:
"You don't achieve reliable communication by eliminating errors - you achieve it by detecting and recovering from errors."
I think that is a good point.

Marlin offers a way to achieve this by using line number and CRC checks for the gcode.
You could allow for higher baud rates, because errors can be handles when needed.
Programs like Pronterface and Octoprint use this strategy.

Online I found this:
Reset line number count with gcode command: "M110 N"

The CRC and line number can be added to a line of gcode like this: (thanks @digant73)

static uint32_t line_number = 0;
char cmd_ptr[100];
stripChecksum(cmd);                     // e.g. "G28\n" -> "G28"
sprintf(cmd_ptr, "N%d %s", line_number++, cmd);                     // e.g. "G28" -> "N2 G28"
sprintf(&cmd_ptr[strlen(cmd_ptr)], "*%d\n", getChecksum(cmd_ptr));  // e.g. "N2 G28" -> "N2 G28*56\n"

Some error messages need to be processed:
"Error:Line Number is not Last Line Number+1, Last Line: "
"Error: checksum mismatch, Last Line: "
"Error: No Checksum with line number, Last Line: "

And resend request:
"Resend: "

When not using ADVANCED_OK, this becomes quite trivial, because the last line just needs to be resend.
When using ADVANCED_OK an index of the last line numbers and gcode file indexes could be kept like this:
Line 120 --> file index 1000
Line 121 --> file index 1040
Line 122 --> file index 1082
Line 123 --> file index 1090
Line 124 --> file index 1130
Line 125 --> file index 1160
Line 126 --> file index 1170
Line 127 --> file index 1200
After a "Resend: 123", printing should resume at line 123, file index 1090

@rondlh rondlh added the enhancement New feature or request label Sep 20, 2023
@rondlh
Copy link
Author

rondlh commented Oct 1, 2023

I analyzed the data corruption I'm experiencing. Sometimes a bytes is lost.
When I look up the serial read code in Marlin I find that it will drop the byte that is being read if there is a frame error or parity error. A frame error means that the alignment of the received byte is out of sync. Also an Over Run Error can occur when the serial byte being received is not read before the next byte overwrites it. This will also lead to the kind of serial transmission errors I found.
I will add some debug code to Marlin to see which case is happening here, I expect the Over Run Error because I see this issue exclusively when I push the printer for stress testing.

I am working on adding checksum capabilities to the TFT, and have some first results. This proved to be simpler than I initially anticipated because some code I needed is already there.
The Marlin log below shows what commands Marlin received, and if it found an issue in the gcode line number of checksum. In the log you can see that Marlin found an error, and requested a resend, but the TFT was already sending the next line, so it had to go 2 steps back, which it did successfully.
Under the Marlin log I show a fragment of the Gcode file that hold the correct gcodes that Marlin should process.

MARLIN LOG

N93 G1X118.661Y91.005Z36.719E917.46676*61  <--- (92 OK)
N94 G1X119.028Y92.607Z36.72E917.50092*6    <--- (93 OK)
N95 G1X119.345Y94.255Z36.721E917.53581*63  <--- (95 OK)
N96 GX119.619Y95.963Z36.723E917.57177*52   <--- CORRUPTED LINE "G1X" --> "GX"
Error:Line Number is not Last Line Number+1, Last Line: 95  <--- Marlin says line 95 was ok
Resend: 96                                 <--- Resend request 96
N97 G1X119.844Y97.716Z36.724E917.60851*51  <--- This line was already on the way
Error:Line Number is not Last Line Number+1, Last Line:95
Resend: 96                                 <--- Resend request again 96
N96 G1X119.619Y95.963Z36.723E917.57177*52  <--- TFT Responded, resents 96
N97 G1X119.844Y97.716Z36.724E917.60851*51  <--- Continues at 97, correct!
N98 G1X120.023Y99.512Z36.726E917.64603*56  <--- 98
N99 G1X120.148Y101.32Z36.727E917.6837*14   <--- 99
N100 G1X120.224Y103.154Z36.729E917.72186*9 <--- 100

GCODE FILE These are the Gcodes that Marlin should receive

G1X118.253Y89.479Z36.717E917.43392  <--- 92
G1X118.661Y91.005Z36.719E917.46676  <--- 93
G1X119.028Y92.607Z36.72E917.50092   <--- 94
G1X119.345Y94.255Z36.721E917.53581  <--- 95 last correct line
G1X119.619Y95.963Z36.723E917.57177  <--- 96 line with problem
G1X119.844Y97.716Z36.724E917.60851  <--- 97
G1X120.023Y99.512Z36.726E917.64603  <--- 98
G1X120.148Y101.32Z36.727E917.6837   <--- 99
G1X120.224Y103.154Z36.729E917.72186 <--- 100

Conclusion, serial data corruption occurred, but was successfully handled and corrected.

@rondlh
Copy link
Author

rondlh commented Oct 4, 2023

I did some more tests to find out what causes the serial data corruption. I connected a PC serial interface to the RX pin of the motherboard (MKS Monster8 V1.0). This way I can review the data that is send to the motherboard..
When the motherboard reports a communication errors (detected using the CRC function) then I can check what the PC received.
Motherboard: 1 character missing
PC: data is complete, no corruption

The strange thing is that I never get a corrupted character, I only have a missing character. Perhaps the motherboard is not fast enough to always reliably process the data at 250K baud?

Marlin uses Arduino STM32 to read (and send) serial data with the help of interrupts. At 250K baud this means that a character takes 40us. So if the motherboard is for some reason not able to process the received character within 40us then a byte could be lost, which should trigger an Overrun error (if enabled). So let me see what error code is reported.
I can see the function HAL_UART_ErrorCallback is triggered when the character is dropped, but there is no ErrorCode or error flag set. It's always 0, HAL_UART_GetError(huart) also gives 0. Very odd!

void HAL_UART_ErrorCallback(UART_HandleTypeDef *huart)
{
#if defined(STM32F1xx) || defined(STM32F2xx) || defined(STM32F4xx) || defined(STM32L1xx)
  if (__HAL_UART_GET_FLAG(huart, UART_FLAG_PE) != RESET) {
    __HAL_UART_CLEAR_PEFLAG(huart); /* Clear PE flag */
  } else if (__HAL_UART_GET_FLAG(huart, UART_FLAG_FE) != RESET) {
    __HAL_UART_CLEAR_FEFLAG(huart); /* Clear FE flag */
  } else if (__HAL_UART_GET_FLAG(huart, UART_FLAG_NE) != RESET) {
    __HAL_UART_CLEAR_NEFLAG(huart); /* Clear NE flag */
  } else if (__HAL_UART_GET_FLAG(huart, UART_FLAG_ORE) != RESET) {
    __HAL_UART_CLEAR_OREFLAG(huart); /* Clear ORE flag */
  }
#else
  if (__HAL_UART_GET_FLAG(huart, UART_FLAG_PE) != RESET) {
    __HAL_UART_CLEAR_FLAG(huart, UART_CLEAR_PEF); /* Clear PE flag */
  } else if (__HAL_UART_GET_FLAG(huart, UART_FLAG_FE) != RESET) {
    __HAL_UART_CLEAR_FLAG(huart, UART_CLEAR_FEF); /* Clear FE flag */
  } else if (__HAL_UART_GET_FLAG(huart, UART_FLAG_NE) != RESET) {
    __HAL_UART_CLEAR_FLAG(huart, UART_CLEAR_NEF); /* Clear NE flag */
  } else if (__HAL_UART_GET_FLAG(huart, UART_FLAG_ORE) != RESET) {
    __HAL_UART_CLEAR_FLAG(huart, UART_CLEAR_OREF); /* Clear ORE flag */
  }
#endif

  /* Restart receive interrupt after any error */
  serial_t *obj = get_serial_obj(huart);
  if (obj && !serial_rx_active(obj)) {
    HAL_UART_Receive_IT(huart, &(obj->recv), 1);
  }
}

When I disable the code at the end, which should help to recover from the error, the communication just continues. I found that obj->recv hold the last correctly received character. I'm tempted to replace this interrupt based serial reading with DMA based code...

@digant73
Copy link
Contributor

digant73 commented Jan 26, 2024

@rondlh if you have already implemented a working (also supporting the re-transmission of corrupted gcodes (not implemented by me in the past)) CRC feature it could be useful in this moment also to me just to simplify the flow analysis in one of my printers.
If so, please consider to post a PR or share the code. I will have also a look into it. Thanks.

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators May 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants