Update event stream #6853

davidMcneil · 2019-08-14T17:01:52Z

This PR first addresses #6740 using the existing natsio and nitox nats clients. The PR then removes those clients and switches to using a non-streaming client rust-nats. In initial testing, rust-nats and non-streaming in general appear much more reliable. However, testing was done on a standalone nats server instead of directly with automate. This is blocked on automate supporting a plain nats connection.

There are several TODOs in this PR. Those should be straightforward to address once we verify this is the direction we want to go for our nats client. #6770 is the spike to evaluate the nats clients.

Signed-off-by: David McNeil <[email protected]>

chef-expeditor · 2019-08-14T17:02:08Z

Hello davidMcneil! Thanks for the pull request!

Here is what will happen next:

Your PR will be reviewed by the maintainers.
If everything looks good, one of them will approve it, and your PR will be merged.

Thank you for contributing!

davidMcneil · 2019-08-14T18:59:53Z

This resolves #6740 by adding the event-stream-connect-timeout cli option. This can also be set with the HAB_EVENT_STREAM_CONNECT_TIMEOUT environment variable. This option takes a numeric value that represents the number of seconds to wait for an event stream connection before exiting the supervisor. The value 0 is treated specially. It indicates there is no timeout and the supervisor should immediately start regardless of the state of the event stream connection.

Signed-off-by: David McNeil <[email protected]>

davidMcneil · 2019-08-15T14:48:45Z

Unfortunately, we were not able to avoid forking the rust-nats library. We needed to make the following changes:

make the connect method public
add support for auth token credentials
percentage decode appropriate parts of the nats connection string

The fork can be found here.

Signed-off-by: David McNeil <[email protected]>

christophermaier

Looks good overall... I had a couple observations / documentation tweaks, though. I think we may need to adjust how we're handling subjects, too.

Nice work!

christophermaier · 2019-08-15T19:56:56Z

components/sup/src/event.rs

        trace!("About to queue an event: {:?}", event);
-        if let Err(e) = self.0.unbounded_send(event) {
-            error!("Failed to queue event: {:?}", e);
+        if let Err(e) = self.0.try_send(event) {


Probably worth documenting here that if we fill up the channel (because we're not currently connected to the NATS server), then we'll drop additional messages on the floor here because try_send will return an error.

Actually, it'd be good to use TrySendError::is_full() to get more information in our error logging.

The current error message we give when try_send fails due to a full channel is Failed to queue event: send failed because channel is full. Is there something more you would like to see?

Nope, that's good!

I think a comment in the code about dropping messages would still be useful as a documentation of intent.

components/sup/src/event/stream.rs

christophermaier · 2019-08-15T20:16:21Z

components/sup/src/event/stream.rs

+                                        Ok(())
+                                    });
+
+        Runtime::new().expect("Couldn't create event stream runtime!")


It's probably worth noting here: the reason all this was initially running in a thread in the first place is that the nitox library had an issue where it didn't play well with other futures on its reactor. To work around that, I just put it off on its own reactor on a separate thread.

Since rust-nats presumably doesn't have that issue, we could theoretically move all this into running directly on the Supervisor's main reactor. If we were to do that, however, we'd need to do it in such a way that we could cleanly shut it down when the Supervisor needs to come down, or run into the same underlying issue that was behind #6712 and fixed by #6717.

(It's perfectly fine to leave it as is; I'm just thinking it would be good to leave a comment here to ensure that any well-meaning refactoring engineer that comes after us knows what the trade-offs are.)

These commits here and here address this.

It seems like we need a more robust and ergonomic solution for ensuring all futures are stopped before shutting down. This is outside the scope of this PR, but just to get some ideas out there.

What if we made a "wrapper" that wraps two runtimes. This "wrapper" could have a spawn_divergent (or something to indicate that this future does not end) call. This would spawn the future on a runtime that calls shutdown_now instead of shutdown_on_idle when we shutdown.

Not sold on this solution, but it would be nice to not have to keep a handle for every divergent future. I wonder how other tokio projects handle correctly ending unbounded futures. Thoughts?

Neat 😄

I'm not sure what the best way going forward should be for this. I like the "handle" approach, since it's explicit, but it does require a little bookkeeping. I haven't seen other approaches for this, though (which is what motivated that handle solution initially).

Your "two Runtimes" approach is also an interesting one, and is worth digging into, I think.

As long as the code in this PR doesn't get us back into a 0.83.0 bug situation, I'm 👍 on merging it.

christophermaier · 2019-08-15T20:23:59Z

components/sup/src/event/stream.rs

@@ -11,7 +11,7 @@ use tokio::{prelude::Stream,
            runtime::current_thread::Runtime};

 /// All messages are published under this subject.
-const HABITAT_SUBJECT: &str = "habitat";
+const HABITAT_SUBJECT: &str = "habitat.event.healthcheck";


We send out more events than just healthchecks, though... if we're going to have different subjects per event, we'll need to handle that a little differently.

Thanks for catching this! I will follow up with the automate team and see what they expect.

Signed-off-by: David McNeil <[email protected]>

christophermaier

Once the event subject situation is resolved, I'm 👍

Signed-off-by: David McNeil <[email protected]>

ericcalabretta · 2019-08-19T19:54:58Z

@davidMcneil if a user sets HAB_EVENT_STREAM_CONNECT_TIMEOUT=0 Would the supervisor attempt to connect to Automate if it was not initially available?

Automate may be unavailable because of an upgrade, failure, etc and users may want to still start the supervisor & services. It sounds like setting the value to 0 would accomplish this but users would also want the supervisor to connect to Automate when/if it becomes available. When the upgrade is complete or service was restored from whatever the failure was.

davidMcneil · 2019-08-19T20:04:01Z

@ericcalabretta Regardless of the value of HAB_EVENT_STREAM_CONNECT_TIMEOUT habitat will always try and connect to automate when it goes to publish an event (if it is disconnected). HAB_EVENT_STREAM_CONNECT_TIMEOUT only impacts startup behavior. So if a value of 0 is used habitat will eventually connect to automate given it comes up at the correct url with the correct auth token.

ericcalabretta · 2019-08-19T20:06:10Z

@davidMcneil That's perfect, thanks for the clarifications.

davidMcneil added 6 commits August 14, 2019 09:23

Update event stream struct naming

6ba5b4a

Signed-off-by: David McNeil <[email protected]>

Move event-stream-connect-timeout into structs

68f9e43

Signed-off-by: David McNeil <[email protected]>

Move EventStreamConnectTimeout to types

d9d65e3

Signed-off-by: David McNeil <[email protected]>

Add event-stream-connect-timeout cli option

0b2445a

Signed-off-by: David McNeil <[email protected]>

Add immediate event stream connect method

4f1aef6

Signed-off-by: David McNeil <[email protected]>

Remove natsio and nitxo clients

724b55a

Signed-off-by: David McNeil <[email protected]>

davidMcneil requested review from baumanj, chefsalim, christophermaier, eeyun and raskchanky as code owners August 14, 2019 17:01

Add non-streaming client

60df8c6

Signed-off-by: David McNeil <[email protected]>

davidMcneil force-pushed the dmcneil/event-stream branch from f299d31 to 60df8c6 Compare August 14, 2019 20:45

Add auth token support

d137ddc

Signed-off-by: David McNeil <[email protected]>

davidMcneil force-pushed the dmcneil/event-stream branch from 35e4fdb to d137ddc Compare August 15, 2019 14:53

davidMcneil added 3 commits August 15, 2019 12:29

Switch to bounded event channel

3a1bf53

Signed-off-by: David McNeil <[email protected]>

Remove nitox pipeline tests

142fd6f

Signed-off-by: David McNeil <[email protected]>

Use correct healthcheck subject

799feaa

Signed-off-by: David McNeil <[email protected]>

christophermaier suggested changes Aug 15, 2019

View reviewed changes

davidMcneil added 2 commits August 16, 2019 10:44

Move runtime into Manager

1522866

Signed-off-by: David McNeil <[email protected]>

Use Manager's runtime for event stream

b9d80a2

Signed-off-by: David McNeil <[email protected]>

davidMcneil force-pushed the dmcneil/event-stream branch from e62b6ef to b9d80a2 Compare August 16, 2019 21:00

christophermaier approved these changes Aug 16, 2019

View reviewed changes

Use correct subjects for event types

0df9b3a

Signed-off-by: David McNeil <[email protected]>

davidMcneil merged commit 160756e into master Aug 19, 2019

chef-ci deleted the dmcneil/event-stream branch August 19, 2019 14:34

davidMcneil mentioned this pull request Aug 19, 2019

Fix event stream reconnect #6761

Closed

mwrock added the X-feature label Aug 21, 2019

christophermaier added Type:Feature PRs that add a new feature and removed X-feature labels Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update event stream #6853

Update event stream #6853

davidMcneil commented Aug 14, 2019

chef-expeditor bot commented Aug 14, 2019

davidMcneil commented Aug 14, 2019

davidMcneil commented Aug 15, 2019 •

edited

Loading

christophermaier left a comment

christophermaier Aug 15, 2019

davidMcneil Aug 16, 2019 •

edited

Loading

christophermaier Aug 16, 2019

christophermaier Aug 16, 2019

christophermaier Aug 15, 2019

davidMcneil Aug 16, 2019 •

edited

Loading

christophermaier Aug 16, 2019

christophermaier Aug 15, 2019

davidMcneil Aug 16, 2019 •

edited

Loading

christophermaier left a comment

ericcalabretta commented Aug 19, 2019

davidMcneil commented Aug 19, 2019

ericcalabretta commented Aug 19, 2019

Update event stream #6853

Update event stream #6853

Conversation

davidMcneil commented Aug 14, 2019

chef-expeditor bot commented Aug 14, 2019

davidMcneil commented Aug 14, 2019

davidMcneil commented Aug 15, 2019 • edited Loading

christophermaier left a comment

Choose a reason for hiding this comment

christophermaier Aug 15, 2019

Choose a reason for hiding this comment

davidMcneil Aug 16, 2019 • edited Loading

Choose a reason for hiding this comment

christophermaier Aug 16, 2019

Choose a reason for hiding this comment

christophermaier Aug 16, 2019

Choose a reason for hiding this comment

christophermaier Aug 15, 2019

Choose a reason for hiding this comment

davidMcneil Aug 16, 2019 • edited Loading

Choose a reason for hiding this comment

christophermaier Aug 16, 2019

Choose a reason for hiding this comment

christophermaier Aug 15, 2019

Choose a reason for hiding this comment

davidMcneil Aug 16, 2019 • edited Loading

Choose a reason for hiding this comment

christophermaier left a comment

Choose a reason for hiding this comment

ericcalabretta commented Aug 19, 2019

davidMcneil commented Aug 19, 2019

ericcalabretta commented Aug 19, 2019

davidMcneil commented Aug 15, 2019 •

edited

Loading

davidMcneil Aug 16, 2019 •

edited

Loading

davidMcneil Aug 16, 2019 •

edited

Loading

davidMcneil Aug 16, 2019 •

edited

Loading