Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing on Win10 occasionally fails (and loses data) #781

Closed
giseburt opened this issue May 2, 2016 · 21 comments
Closed

Writing on Win10 occasionally fails (and loses data) #781

giseburt opened this issue May 2, 2016 · 21 comments
Labels

Comments

@giseburt
Copy link
Contributor

giseburt commented May 2, 2016

Operating System and Hardware: Windows 10
NodeJS Version: electron 0.37.2
Serialport version: 2.1.2
Python Version: (unknown)

Expected behavior

Writing to the serialport would occasionally block in a thread, as it does on Linux and OS X. The WriteQueue for that fd will continue to be added to for further writes. (See https://github.com/voodootikigod/node-serialport/blame/master/src/serialport.cpp#L298 , which is code I helped add. 😄 ).

Alternatively, if blocking in a thread isn't acceptable for Windows, the timeout should NOT pop that write from the queue but instead modify it to prevent double-sending of sent portion of the data, and ensure that the send will be attempted again very shortly. (We don't want to cause a performance issue here.)

It's vital that we don't add semantics that pay attention to the flow control of the device. This is regularly abused and causes all sorts of troubles. (Flow control is currently broken in Python for us for this very reason.)

Actual behavior

An error (generated from https://github.com/voodootikigod/node-serialport/blob/master/src/serialport_win.cpp#L374 ) with error code 121 ("Semphore Timout": https://msdn.microsoft.com/en-us/library/windows/desktop/ms681382(v=vs.85).aspx#ERROR_SEM_TIMEOUT ).

Steps and/or code to reproduce the issue

Basically, you want to coerce the device to not read what we write, forcing Windows to block while writing and eventually timing out, losing data in the process.

Option 1 (explicit flow control):

  1. Open connection to a device with flow control enabled (hardware is what we are using)
  2. write() data to the device until the device's buffer fills and it signals that flow control should stop. NOTE: This should NOT be explicitly handled by serialport. The OS handles this properly and that should be left as-is.
  3. write() more data to the serialport. The error will happen and the data written will be lost instead of queued.

Option 2 (USB-based flow control):

  1. Open a connection to a device such as an Arduino Due (native port) that has a USB virtual-serial port sketch running.
  2. Instruct the Due to not read from that serialport, allowing it's buffer to full and it will then stop accepting packets from the USB host.
    3.write() data to device until you see the error.
@reconbot
Copy link
Member

reconbot commented May 4, 2016

Oh nice bug report!

@reconbot
Copy link
Member

reconbot commented May 5, 2016

There are a few changes that write needs. This is one of them. The other is it calls the callback when there is still data to be written, and end up calling it multiple times. I'm going to fix the later bug first and get good integration tests on it, then work on this one. If you want to land a patch in the meantime I can help with the automated testing.

@reconbot
Copy link
Member

I've written the tests that writes 5k at 9600 baud. This was enough to force multiple writes on unix but doesn't seem to cause this issue on our CI server. We should probably test on an actual computer, but it should exercise your proposed option 1.

@reconbot
Copy link
Member

The branch is write-bindings where I'm doing the work.

@reconbot
Copy link
Member

@giseburt I've done a lot more research and I've got a PR #808 out for testing. On our CI (Windows Server 2012 R2 ) a 5k write at 9600 baud doesn't cause the issue. I don't have a windows 10 box around for testing. I'm curious to why we're getting ERROR_SEM_TIMEOUT and what it means. As far as I can tell we're opening for async writing and we'll block to wait for that write. If it's async, what's timing out?

If the ERROR_SEM_TIMEOUT is coming from WriteFile then I'm confused, this doesn't seem like it should happen, and I don't know if any data is actually being queued for writing or how to handle the error.

If the ERROR_SEM_TIMEOUT is coming from GetOverlappedResult then I suppose I understand, but I don't know what to do, should we try calling GetOverlappedResult in a loop until we don't get that error anymore?

You can test this branch by changing your package.json to include {"serialport": "voodootikigod/node-serialport#write-bindings"}

Do you have the full text of your error?

I'm glad you're back helping improve serialport! =)

@reconbot
Copy link
Member

Also it looks like we don't lose data, just repeat data if it does get written which I can't seem to find any documentation to confirm.

@reconbot
Copy link
Member

@giseburt I'd love to get your feedback on this, I'm unable to reproduce it but I'm happy to add the correct logic if you can help explain it further.

@giseburt
Copy link
Contributor Author

I'm just now starting to work on this again. We have customers waiting on it, so I'll have some more info later today.

@giseburt
Copy link
Contributor Author

giseburt commented Jul 21, 2016

@reconbot Some more background on the problem, how we're using it, and what I think/hope will fix this.

Note that this issue doesn't exist on Linux or OS X. In both of those OSs, the write it truly blocking -- which is fine, it's in it's own thread.

Background: I work at @synthetos on the TinyG and G2 Core (the next generation of TinyG) projects. In both cases, we use either a USB->serial port or a true serial port (from a BeagleBone, C.H.I.P, or Pi, for example) to talk to the board.

In G2 Core, we are doing high-speed control of machines like 3D printers. We queue up movements that come in from the serial port, and when our queue or serial port buffers are full, we stop reading from the serial port.

To prevent the data from getting lost to /dev/null during this time, we use hardware flow control (real or simulated with the USB->serial) to tell the host OS (which is the part running node-serialport) to stop writing.

Problem Description: When the host OS is told to stop writing, it continues to buffer data before it starts to block writing to the file descriptor. On OS X and Linux we tell it to use block without a timeout, and that works perfectly. On Windows, we configure the timeouts to non-zero (see docs), and then when it does timeout we throw away the "packet" that partially sent and then move to the next packet.

Probable Solution: Simply set the write timeouts to zero. Then the thread would completely block for the write to complete, or the port to close.

There is one other concern. When you compare the linux write call to the windows write call you see that the windows version writes from data->bufferData for data->bufferLength bytes, even though it's in a while loop that adjusts data->offset. The linux version has it right by writing from data->bufferData + data->offset for data->bufferLength - data->offset bytes.

I'll now attempt to work out these fixes and see if I can find a clean test to prove not-working before and working after.

** UPDATED ** - I found that we ARE setting the timeouts.

@fduman
Copy link

fduman commented Aug 2, 2016

I sent some data to ESP8266's NodeMCU. Same data loss happened. When I set baudrate to 9600 it works perfectly.

@giseburt, I tried to apply your solution but I haven't succeded yet. The same data loss happened again on Windows 10.

I tried to set "write timeouts" to 0. The same issue happened. I also fixed the data->bufferData + data->offset issue. But i think that this is not the case. I believe that it doesn't iterate. So that one has no effect at all.

reconbot added a commit that referenced this issue Oct 20, 2016
@reconbot
Copy link
Member

Should be fixed in master, I'll have a beta out shortly

@reconbot
Copy link
Member

Thoughts on the read timeouts? It doesn't seem like we'd want them either

@giseburt
Copy link
Contributor Author

giseburt commented Oct 20, 2016

I thought about that too. I wasn't sure enough to change it without proof
that it was broken.

Logically, it seems that there should not be timeouts for reads either. I
can't think of what it could harm to disable them as well.

@reconbot
Copy link
Member

reconbot commented Oct 20, 2016

Lets do it and give it a good manual testing.

@giseburt
Copy link
Contributor Author

giseburt commented Oct 20, 2016

Sounds good. 👍🏻

reconbot added a commit that referenced this issue Oct 20, 2016
reconbot added a commit that referenced this issue Oct 20, 2016
- [all] fixed a crash when pausing while reading thanks to @bminer and @baffo32 and others to debug and fix this
- [all] ReadLineParser has been renamed to ReadlineParser to match worldwide convention
- [windows] Fix flush behavior using PurgeComm fixing #962 via @samisaham
- [windows] Remove read and write timeouts solving #781 via @giseburt
- [docs] Electron build docs #965 via @chalkers
- [docs] Spelling fixes via @Awk34
- [docs] Update parser docs to be correct #970 via @jacobq
- [notes] buffer-indexof is now completely compliant with latest node 6 buffer.indexOf because it's nice to have and we now use it on older nodes
- [notes] dropped node 5 support, should still work but you shouldn't use it
@Zensey
Copy link
Contributor

Zensey commented Nov 4, 2016

I gave a try to version 5.0.0-beta2.
Result: programm gets no response from device.
Even Device Monitoring Studio shows no signs of response.
I rollwd-back patch #974 and rebuilt binding from source and program started to function again.

I use rs-485, no flow control.

@reconbot reconbot reopened this Nov 4, 2016
@reconbot
Copy link
Member

reconbot commented Nov 4, 2016

How does v4.0.5 work for you?

@giseburt
Copy link
Contributor Author

giseburt commented Nov 4, 2016

According to https://msdn.microsoft.com/en-us/library/windows/desktop/aa363190(v=vs.85).aspx and other sources referring to that same:

ReadIntervalTimeout

The maximum time allowed to elapse before the arrival of the next byte on the communications line, in milliseconds. If the interval between the arrival of any two bytes exceeds this amount, the ReadFile operation is completed and any buffered data is returned. A value of zero indicates that interval time-outs are not used.

A value of MAXDWORD, combined with zero values for both the ReadTotalTimeoutConstant and ReadTotalTimeoutMultiplier members, specifies that the read operation is to return immediately with the bytes that have already been received, even if no bytes have been received.

So I believe what we want, what other platforms call "non-blocking read," is to set ReadIntervalTimeout to MAXDWORD.

(I emailed this and it's not showing up so hopefully it doesn't duplicate.)

@Zensey
Copy link
Contributor

Zensey commented Nov 4, 2016

How does v4.0.5 work for you?

With [email protected] everything is ok.

@reconbot
Copy link
Member

reconbot commented Jan 4, 2017

I've updated the timeout again, sorry about taking so long. I'll see about releasing another beta.

@reconbot reconbot removed this from the 4.0 milestone Jan 4, 2017
@reconbot
Copy link
Member

I have issue #1221 which usurps this, and will be releasing 5.0.0 pretty soon, so I'm going to close this issue. I'd love to see if it's still happening though I think #1221 is the way to go to work around it.

@lock lock bot locked as resolved and limited conversation to collaborators May 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

4 participants