Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update event stream #6853

Merged
merged 14 commits into from
Aug 19, 2019
Merged

Update event stream #6853

merged 14 commits into from
Aug 19, 2019

Conversation

davidMcneil
Copy link
Contributor

Resolves #6761 #6740

This PR first addresses #6740 using the existing natsio and nitox nats clients. The PR then removes those clients and switches to using a non-streaming client rust-nats. In initial testing, rust-nats and non-streaming in general appear much more reliable. However, testing was done on a standalone nats server instead of directly with automate. This is blocked on automate supporting a plain nats connection.

There are several TODOs in this PR. Those should be straightforward to address once we verify this is the direction we want to go for our nats client. #6770 is the spike to evaluate the nats clients.

@chef-expeditor
Copy link
Contributor

Hello davidMcneil! Thanks for the pull request!

Here is what will happen next:

  1. Your PR will be reviewed by the maintainers.
  2. If everything looks good, one of them will approve it, and your PR will be merged.

Thank you for contributing!

@davidMcneil
Copy link
Contributor Author

This resolves #6740 by adding the event-stream-connect-timeout cli option. This can also be set with the HAB_EVENT_STREAM_CONNECT_TIMEOUT environment variable. This option takes a numeric value that represents the number of seconds to wait for an event stream connection before exiting the supervisor. The value 0 is treated specially. It indicates there is no timeout and the supervisor should immediately start regardless of the state of the event stream connection.

Signed-off-by: David McNeil <[email protected]>
@davidMcneil davidMcneil force-pushed the dmcneil/event-stream branch from f299d31 to 60df8c6 Compare August 14, 2019 20:45
@davidMcneil
Copy link
Contributor Author

davidMcneil commented Aug 15, 2019

Unfortunately, we were not able to avoid forking the rust-nats library. We needed to make the following changes:

  • make the connect method public
  • add support for auth token credentials
  • percentage decode appropriate parts of the nats connection string

The fork can be found here.

Signed-off-by: David McNeil <[email protected]>
@davidMcneil davidMcneil force-pushed the dmcneil/event-stream branch from 35e4fdb to d137ddc Compare August 15, 2019 14:53
Copy link
Contributor

@christophermaier christophermaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall... I had a couple observations / documentation tweaks, though. I think we may need to adjust how we're handling subjects, too.

Nice work!

trace!("About to queue an event: {:?}", event);
if let Err(e) = self.0.unbounded_send(event) {
error!("Failed to queue event: {:?}", e);
if let Err(e) = self.0.try_send(event) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth documenting here that if we fill up the channel (because we're not currently connected to the NATS server), then we'll drop additional messages on the floor here because try_send will return an error.

Actually, it'd be good to use TrySendError::is_full() to get more information in our error logging.

Copy link
Contributor Author

@davidMcneil davidMcneil Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current error message we give when try_send fails due to a full channel is Failed to queue event: send failed because channel is full. Is there something more you would like to see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that's good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment in the code about dropping messages would still be useful as a documentation of intent.

Ok(())
});

Runtime::new().expect("Couldn't create event stream runtime!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth noting here: the reason all this was initially running in a thread in the first place is that the nitox library had an issue where it didn't play well with other futures on its reactor. To work around that, I just put it off on its own reactor on a separate thread.

Since rust-nats presumably doesn't have that issue, we could theoretically move all this into running directly on the Supervisor's main reactor. If we were to do that, however, we'd need to do it in such a way that we could cleanly shut it down when the Supervisor needs to come down, or run into the same underlying issue that was behind #6712 and fixed by #6717.

(It's perfectly fine to leave it as is; I'm just thinking it would be good to leave a comment here to ensure that any well-meaning refactoring engineer that comes after us knows what the trade-offs are.)

Copy link
Contributor Author

@davidMcneil davidMcneil Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commits here and here address this.

It seems like we need a more robust and ergonomic solution for ensuring all futures are stopped before shutting down. This is outside the scope of this PR, but just to get some ideas out there.

What if we made a "wrapper" that wraps two runtimes. This "wrapper" could have a spawn_divergent (or something to indicate that this future does not end) call. This would spawn the future on a runtime that calls shutdown_now instead of shutdown_on_idle when we shutdown.

Not sold on this solution, but it would be nice to not have to keep a handle for every divergent future. I wonder how other tokio projects handle correctly ending unbounded futures. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat 😄

I'm not sure what the best way going forward should be for this. I like the "handle" approach, since it's explicit, but it does require a little bookkeeping. I haven't seen other approaches for this, though (which is what motivated that handle solution initially).

Your "two Runtimes" approach is also an interesting one, and is worth digging into, I think.

As long as the code in this PR doesn't get us back into a 0.83.0 bug situation, I'm 👍 on merging it.

@@ -11,7 +11,7 @@ use tokio::{prelude::Stream,
runtime::current_thread::Runtime};

/// All messages are published under this subject.
const HABITAT_SUBJECT: &str = "habitat";
const HABITAT_SUBJECT: &str = "habitat.event.healthcheck";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We send out more events than just healthchecks, though... if we're going to have different subjects per event, we'll need to handle that a little differently.

Copy link
Contributor Author

@davidMcneil davidMcneil Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! I will follow up with the automate team and see what they expect.

@davidMcneil davidMcneil force-pushed the dmcneil/event-stream branch from e62b6ef to b9d80a2 Compare August 16, 2019 21:00
Copy link
Contributor

@christophermaier christophermaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the event subject situation is resolved, I'm 👍

@davidMcneil davidMcneil merged commit 160756e into master Aug 19, 2019
@chef-ci chef-ci deleted the dmcneil/event-stream branch August 19, 2019 14:34
@ericcalabretta
Copy link
Contributor

@davidMcneil if a user sets HAB_EVENT_STREAM_CONNECT_TIMEOUT=0 Would the supervisor attempt to connect to Automate if it was not initially available?

Automate may be unavailable because of an upgrade, failure, etc and users may want to still start the supervisor & services. It sounds like setting the value to 0 would accomplish this but users would also want the supervisor to connect to Automate when/if it becomes available. When the upgrade is complete or service was restored from whatever the failure was.

@davidMcneil
Copy link
Contributor Author

@ericcalabretta Regardless of the value of HAB_EVENT_STREAM_CONNECT_TIMEOUT habitat will always try and connect to automate when it goes to publish an event (if it is disconnected). HAB_EVENT_STREAM_CONNECT_TIMEOUT only impacts startup behavior. So if a value of 0 is used habitat will eventually connect to automate given it comes up at the correct url with the correct auth token.

@ericcalabretta
Copy link
Contributor

@davidMcneil That's perfect, thanks for the clarifications.

@christophermaier christophermaier added Type:Feature PRs that add a new feature and removed X-feature labels Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type:Feature PRs that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants