Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dhcp state machine #95

Merged
merged 88 commits into from
May 1, 2023
Merged

New dhcp state machine #95

merged 88 commits into from
May 1, 2023

Conversation

BiffoBear
Copy link
Contributor

closes #94 New DHCP FSM implements exponential fallback and correctly renews or rebinds. The messages are sent in unicast mode. Also added a debug message helper function with a hexdump to improve code readability. DHCP messages are now sent and received without using the socket module. This should save some memory.

BiffoBear and others added 30 commits December 17, 2022 11:16
@BiffoBear
Copy link
Contributor Author

I think I've successfully merged the changes. Squished a bug and increased the buffer size from 318 to 512 after receiving a record breaking DHCP response.
DHCP connection successful.
Runs these examples:

  1. wiznet5k_simpletest.py
  2. wiznet5k_simpleserver.py

@FoamyGuy
Copy link
Contributor

@BiffoBear I've gotten wireshark set up and have figured out how to capture the traffic in there.

I have seen a few different ways that it can fail with the version from this PR, and unfortunately have not had a complete success yet with this version.

For one thing, I think there may be some other "timeout 1000 seconds" issue that remains in this version (I know #97 solved something like that). The behavior that I see most typically is the device seemingly hanging or waiting forever after:

Receiving a DHCP response.
+ Beginning to receive…
socket_available called on socket 5, protocol 2 # <- this line looped over and over seemingly forever

from the debug output. I added an additional debug output between the loop and the if statement here:

        while bytes_read < minimum_packet_length and time.monotonic() < timeout:
            debug_msg(f"{bytes_read} < {minimum_packet_length} and {time.monotonic()} < {timeout}", self._debug)
            if self._eth.socket_available(self._wiz_sock, _SNMR_UDP):

And from that I can see that it does have ~ 1000 seconds on the timeout counter.

0 < 236 and 1127.94 < 2258.2

This log file contains the output from debug logs as well as the packet copied from wireshark for one of these "hang forever" attempts.
wiznet_log_pr_hangforever_trying_dhcp.txt

This log file contains wireshark output from currently released library (not this PR) doing a successfull DHCP negotiation (successful HTTP requests occured after this).
wiznet_log_released_success.txt I'm not entirely sure what to make of these outputs, one thing I noticed is that this successfull run had 'offer' and 'ack' packets inside wireshark that I think were sent from the router back to the device. I've not seen any packets like that with the PR version so far. It's always either 'Discover' and 'Request' or 'Discover' only.

This log file contains output from this PR version getting and printing an IP address successfully, but then seems to just hang afterward. Never prints the hostname, or any other request responses. Wireshark dumps from the two packets that were send during this time are included.
wiznet_log_pr_ipsuccess_then_hang.txt

So far, on all attempts with the version from this PR I either:

  • Get and print the IP successfully, then seemingly hang forever, never printing the hostname results, or any requests results
  • Don't get or print the IP, but still seemingly hang forever.

I probably have not waited the full 1000 seconds if there are still timeouts like that though, so perhaps my "seemingly hang forever" would have actually done something else with more time. I only added additional log messages to try to verify that during the last bit of my testing, much of it occurred before I had those extra prints.

All testing was with simpletest from this repo on Feather ESP32-S2 TFT.

@BiffoBear
Copy link
Contributor Author

Great catch on the timeout. That's a bug where I was calculating a timeout based on the value of time.monotonic, then later on forgetting that I'd done that and adding time.monotonic on again. So that is fixed.

I'm perplexed by your Wireshark logs. Your wiznet_log_pr_ipsuccess_then_hang.txt log only contains two discover packets sent by your WIZNET device.
0000 ff ff ff ff ff ff de ad be ef fe ed 08 00 45 00 bytes 0000 - 0005 are the broadcast MAC address and 0006 - 000c are the sender's MAC address.
Based on this, there shouldn't be an IP address assigned as the DHCP server is never responding.

I've pushed a fix for the timeout bug. I'll spend some time decoding the DHCP packets and get back to you later about that.

Thanks for your review!

@BiffoBear
Copy link
Contributor Author

@FoamyGuy The output from your file wiznet_log_pr_hangforever_trying_dhcp.txt contains:

Chip Version: w5500
MAC Address: ['0xde', '0xad', '0xbe', '0xef', '0xfe', '0xed']
My IP address is: 192.168.1.208

These print statements execute after eth = WIZNET5K(spi_bus, cs) and before print("IP lookup adafruit.com: %s" % eth.pretty_ip(eth.get_host_by_name("adafruit.com"))) so the issue seems to be in the DNS lookup.

I'll dig out a w5500 that I have knocking around and see whether I get the same issue with that. I've been testing on a w5100s.

@BiffoBear BiffoBear marked this pull request as draft March 30, 2023 12:12
@BiffoBear
Copy link
Contributor Author

@FoamyGuy I think I've found where it is hanging. I've added a timeout that raises a RuntimeError to break the hang some debug statements that should reveal where the error is occurring. This may not address the underlying cause of the hang though, so I've made the PR a draft for now.

When you have a moment I would appreciate a copy of the serial out with debugging on for a failed connection, in case your wonderfully weird network is doing something different to mine. Thx!

@FoamyGuy
Copy link
Contributor

@BiffoBear here are logs from the two different states I'm seeing with the latest version.

This log is after a cold boot up (i.e. unplug / replug) it is partially successful it seems like it got the DNS lookup of adafruit.com done successfully (though outputs as hex instead of dotted string). But then it failed after that with an exception that is raised.
partial_success_i_think.txt

After that cold boot and partial success if I then do ctrl-C / ctrl-D to restart the script and have it run again I seem to always end up with a "retried 3 times and failed" type error. Here is the log from this cast:
failed_after_3_tries.txt

Let me know if there is anything else that would be helpful to provide or test out on my end. Thank you so much for sticking with this and all of the improvements you've submitted for this library in this PR and beyond!

@BiffoBear
Copy link
Contributor Author

@BiffoBear here are logs from the two different states I'm seeing with the latest version.

This log is after a cold boot up (i.e. unplug / replug) it is partially successful it seems like it got the DNS lookup of adafruit.com done successfully (though outputs as hex instead of dotted string). But then it failed after that with an exception that is raised. partial_success_i_think.txt

After that cold boot and partial success if I then do ctrl-C / ctrl-D to restart the script and have it run again I seem to always end up with a "retried 3 times and failed" type error. Here is the log from this cast: failed_after_3_tries.txt

Let me know if there is anything else that would be helpful to provide or test out on my end. Thank you so much for sticking with this and all of the improvements you've submitted for this library in this PR and beyond!

Thanks, that partial success showed me where my logic was flawed. I've pushed a band-aid fix. Once this PR is done, I've got some ideas for the WIZNET5K class.

I don't see why there is apparently no reply from the DHCP server after a warm reboot. That really puzzles me.

Please would you record the exchange in Wireshark? If you use the capture filter portrange 67-68 that will only capture DHCP packets. Once the DHCP fails after 3 retries please stop the capture and save the results as a Wireshark/… - pcapng file. I can open this file in Wireshark here and get a better idea of the exchange with the server. Maybe my PR is missing valid packets from the server?

Thanks for all your help and encouragement. No rush.

Martin

@BiffoBear
Copy link
Contributor Author

@FoamyGuy I may have been barking up the wrong tree. I've added an explicit reset to the w5500 detection code. When you have a moment, please see if this solves your no DHCP after soft reset issue. Thanks!

@BiffoBear BiffoBear marked this pull request as ready for review April 17, 2023 12:04
@FoamyGuy
Copy link
Contributor

@BiffoBear I have tested out the latest versions

Unfortunately the latest commit seems to cause more issues in my environment. If I test the absolute latest version including the newest commit I consistently end up with Failed to initialize WIZnet module. exception raised. I tried many times both from cold boot and after reset and always had the same results. My instinct was that perhaps we needed to sleep some time after the reset command before attempting anything else. I did tinker with adding a sleep, but it didn't seem to have any effect. I still kept getting failure to initialize.

In my local copy I commented out the lines added in the latest commit so that I could test the version from before that and capture some additional logs.

All of this output was obtained on a Feather S2 TFT with the latest version minus the last commit with the reset command, that is commented out in the version running while these logs were captured:

debug=True logs from serial console:
debug_dhcp_failed_from_reset_after_success.txt
debug_full_success_from_cold_boot.txt

wireshark logs from failed DHCP attempt:
wireshark_fail_dhcp_0.txt
wireshark_fail_dhcp_1.txt
wireshark_fail_dhcp_2.txt
wireshark_fail_dhcp_3.txt

Wireshark logs from a successful DHCP and simpletest exectution:
wireshark_success_from_cold_boot_0.txt
wireshark_success_from_cold_boot_1.txt

My results with this version seem to be consistently that after a cold boot up it will successfully connect, get an IP, and successfully fetch data from the internet. Then if I ctrl-D / ctrl-C to reset it will consistently fail with No response from DHCP server after 3 retries. Resetting subsequent times doesn't change the behavior seemingly it's always this 'no response' until power cycled.

After a full power cycle it boots up and succeeds fully 1 time again, then upon first reset is back to the 'no response' state.

@BiffoBear
Copy link
Contributor Author

OK. I'll save it as a draft and wait until I can get my hands on a W5500 to test against Thanks for checking!

@BiffoBear BiffoBear marked this pull request as draft April 22, 2023 18:07
@BiffoBear BiffoBear marked this pull request as ready for review April 26, 2023 05:46
@BiffoBear
Copy link
Contributor Author

@FoamyGuy This time I promise 🤪! I fixed my w5500 so I could test on it. On the current main branch I get the error after 5 x ctrl-D with a w5500, never with a w5100s. On the latest commit for this PR I don't get any errors after 10 x ctrl-D on the W5500 or the w5100s.

@FoamyGuy
Copy link
Contributor

@BiffoBear nice!

I gave the latest version a try and I am getting continued success with it after several resets, running the simpletest.

I'm thinking this is likely good now, but I am going to plan to come back once more and do some more thorough testing with more of the examples.

Thank you so much for seeing it through and making all of these improvements!

@BiffoBear
Copy link
Contributor Author

@BiffoBear nice!

I gave the latest version a try and I am getting continued success with it after several resets, running the simpletest.

I'm thinking this is likely good now, but I am going to plan to come back once more and do some more thorough testing with more of the examples.

Thank you so much for seeing it through and making all of these improvements!

@FoamyGuy And thank you for being such a patient, insightful and thorough reviewer.

Copy link
Contributor

@FoamyGuy FoamyGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are looking good to me at this point.

I tested successfully several of the examples on a Feather RP2040 v8.1.0-beta.2. Everything is functioning as intended and seems to run reliably even through several soft resets.

Thanks again for these improvements @BiffoBear!

@FoamyGuy FoamyGuy merged commit 4db91ac into adafruit:main May 1, 2023
@BiffoBear BiffoBear deleted the new_dhcp_state_machine branch May 1, 2023 17:18
adafruit-adabot added a commit to adafruit/Adafruit_CircuitPython_Bundle that referenced this pull request May 2, 2023
Updating https://github.com/adafruit/Adafruit_CircuitPython_ATECC to 1.2.17 from 1.2.15:
  > Merge pull request adafruit/Adafruit_CircuitPython_ATECC#33 from FoamyGuy/fix_docs
  > Merge pull request adafruit/Adafruit_CircuitPython_ATECC#31 from brass75/issue_25/missing-typehints
  > Add upload url to release action
  > Add .venv to .gitignore
  > Update .pylintrc for v2.15.5
  > Fix release CI files
  > Update pylint to 2.15.5
  > Updated pylint version to 2.13.0
  > Switching to composite actions

Updating https://github.com/adafruit/Adafruit_CircuitPython_LC709203F to 2.3.0 from 2.2.11:
  > Merge pull request adafruit/Adafruit_CircuitPython_LC709203F#25 from jrrickerson/config_voltage_alarm

Updating https://github.com/adafruit/Adafruit_CircuitPython_RockBlock to 1.3.12 from 1.3.11:
  > Merge pull request adafruit/Adafruit_CircuitPython_RockBlock#30 from zachariahpifer/add-type-annotations
  > Add upload url to release action
  > Add .venv to .gitignore

Updating https://github.com/adafruit/Adafruit_CircuitPython_TCS34725 to 3.3.17 from 3.3.16:
  > Merge pull request adafruit/Adafruit_CircuitPython_TCS34725#42 from zemyblue/feat/support_tcs34727
  > Add upload url to release action
  > Add .venv to .gitignore

Updating https://github.com/adafruit/Adafruit_CircuitPython_Touchscreen to 1.2.1 from 1.2.0:
  > Merge pull request adafruit/Adafruit_CircuitPython_Touchscreen#25 from CedarGroveStudios/patch-1

Updating https://github.com/adafruit/Adafruit_CircuitPython_Wiznet5k to 2.5.0 from 2.4.2:
  > Merge pull request adafruit/Adafruit_CircuitPython_Wiznet5k#95 from BiffoBear/new_dhcp_state_machine

Updating https://github.com/adafruit/Adafruit_CircuitPython_MacroPad to 2.2.0 from 2.1.8:
  > Merge pull request adafruit/Adafruit_CircuitPython_MacroPad#44 from kalehmann/feature/display_sleep
  > Add upload url to release action
  > Update .pylintrc for v2.15.5
  > Fix release CI files
  > Update pylint to 2.15.5
  > Updated pylint version to 2.13.0
  > Switching to composite actions

Updating https://github.com/adafruit/Adafruit_CircuitPython_SimpleIO to 3.0.10 from 3.0.9:
  > Merge pull request adafruit/Adafruit_CircuitPython_SimpleIO#67 from RossK1/adding_type_hints
  > Add upload url to release action
  > Add .venv to .gitignore

Updating https://github.com/adafruit/Adafruit_CircuitPython_Bundle/circuitpython_library_list.md to NA from NA:
  > Updated download stats for the libraries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DHCP Finite State Machine mishandles renew and rebind.
3 participants