Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation on skip_grains #53603

Closed
wants to merge 18 commits into from
Closed

Documentation on skip_grains #53603

wants to merge 18 commits into from

Conversation

marbx
Copy link
Contributor

@marbx marbx commented Jun 26, 2019

Introduction

Our master overloaded when we deployed a new minion configuration, in which minions regularly “ping” the salt-master and the salt-master reacts to the minion_ping event (with a runner).

The load exceeded 0.5, after approx. 10% of the Minions were pinging.
The salt-master receives a minion_ping every 2 seconds from different minions.

Therefore we stopped the deployment and profiled the salt-master.

Profiling showed that grains (green) create 33.61% load, called by MasterMinion (yellow), called by RunnerClient (orange).

a

We found this description of MasterMinion in salt/minion.py:

A MasterMinion is “a fully loaded minion function object for generic use on the master. What makes this class different is that the pillar is omitted, otherwise everything else is loaded cleanly.”

A RunnerClient creates a MasterMinion but also omits states and renderer
In salt/client/mixins.py:

self._mminion = salt.minion.MasterMinion(self.opts, states=False, rend=False)

Our RunnerClients don’t need grains. Therefore, we omitted them with the undocumented skip_grains setting.

We set skip_grains: True at 06-18 15:45 and observed an significant (16-fold) and necessary load drop:

  • load = 0.5 with skip_grains: False
  • load = 0.03 with skip_grains: True

b

We later also observed a significant improvement when restarting the master:

  • at 13:29 with skip_grains: False
  • at 13:56 with skip_grains: False
  • at 15:06. with skip_grains: True

start_skip_grains

Content

Created documentation on skip_grains

History

This PR first contained new config items to omit grains in the MasterMinion, along the lines of:

    if grains and self.opts.get('masterminion_grains', True): 
        self.opts['grains'] = salt.loader.grains(opts) 
    else: 
        self.opts['grains'] = {}

Adding new config items received constructive negative feedback, see below.
As we found the existing config item skip_grains , which serves our purpose when set to True on the master, we removed the new config items from this PR.

@marbx marbx requested a review from a team as a code owner June 26, 2019 11:33
@waynew
Copy link
Contributor

waynew commented Jun 26, 2019

Nice! I'm not familiar with the implications here, but it certainly seems like a reasonable improvement! 👍

@marbx
Copy link
Contributor Author

marbx commented Jun 26, 2019

@dwoz, I would kindly ask for guidance how to write tests for this PR.

I think testing must show that all current tests of RunnerClient and MasterMinion, as long as they do not use grains, continue to pass when the proposed options are set to False.

I assume the setUp() procedure, can reconfigure a master.

My understanding is that runners are tested in https://github.com/saltstack/salt/blob/develop/tests/integration/client/test_runner.py

Can the setUp() routine get two boolean parameters, and be invoked 4 times?

setUp(runnerclient_grains=True, masterminion_grains=True)

The test of EventReturn must be accordingly modified, because it invokes MasterMinion, too.

@mattp-
Copy link
Contributor

mattp- commented Jun 27, 2019

@markuskramerIgitt I'm not sure this change makes sense to add Yet Another Config Item (or two) for a special case. What you can do instead to effectively mitigate this change is something like:

  • enable grains_cache: True / grains_cache_ttl: N seconds
  • set up a grains_refresh that is <= the ttl, this will put the blocking grains refresh into the schedulers path
    this should essentially give you the same performance gain. tbh I think grains in general should be configured by default to work this way, so that noone ever gets hit with this blocking performance loss. What are your thoughts about that approach?

@max-arnold
Copy link
Contributor

max-arnold commented Jun 27, 2019

Good point!

What bothers me most is that nobody knows what else will break when the grains are disabled. Existing unit tests can't uncover potential problems because they do not use these options. Someone WILL attempt to run other code paths in this mode, and it will lead to more stale issues, and more complains that Salt is complex and buggy.

This performance problem is important, but the resulting complexity and combinatorial explosion of possible code paths are worse and much harder to fix.

It is not the first time when a grain-related optimization hack is submitted: #49481

UPD: recently I spent hours debugging the grains caching issue #53536

@marbx
Copy link
Contributor Author

marbx commented Jun 28, 2019

Hi @mattp, I have to admit, I too dislike the addition of these 2 new config items.

You introduced LazyLoading for functions in PR #53553, which I noticed a few days ago, only realizing now, you are the author.

If grains could be lazy loaded, it could deliver the performance improvements we need, while those who use grains should not notice a difference, without the addition of config items.

Regarding grains_cache:

In our setup, the only grain we need is the id of the minion, which comes with the event. So essentially we don’t use grains. Therefore, I would not profit from the content of a grains_cache, while it still must be loaded once each grains_cache_ttl. We have 20K minions. For performance (CPU and network), we would tend to increase the TTL, which would reduce the validity of some of the grains, which we don’t need in the first place. We also have very mixed results from trying to use schedulers, so I would wait for a use case before I try these techniques (again).
Combining grains_cache and “lazy loaded” grains would result in “lazy loaded cached” grains.

Therefore I would like to continue with “lazy loaded grains”.
What do you think?

@marbx
Copy link
Contributor Author

marbx commented Jun 28, 2019

Hi @max-arnold, thank you for pointing to the existing grains performance discussions!
What a long story…

I completely agree with you that config items complicate testing (“combinatorial explosion of possible code”). I now realize my proposed new config items are very naïve.

Please allow me to replace them with a better solution - what do you think of “lazy loaded” grains?

@max-arnold
Copy link
Contributor

max-arnold commented Jun 28, 2019

@markuskramerIgitt

Please allow me to replace them with a better solution - what do you think of “lazy loaded” grains?

I do not know the codebase well enough, but at first glance it looks like a good idea that could be explored. Do you want to lazy-load grains on a minion? How this could affect accessing grains on a master (for example in a pillar file)?

Another option is to achieve your goal in a different way:

  1. What you want to do by storing minion ping events in a Redis DB?
  2. What the word "runner" means for you? I guess it is not a runner module?
  3. Could minions access your Redis instance directly? Or use a returner?
  4. Could a custom engine be used instead? (example1, example2)
  5. Why it is necessary to instantiate new MasterMinion on each event?

And (out of curiosity), how did you make this nice profiling graph on a master?

@mattp-
Copy link
Contributor

mattp- commented Jun 28, 2019

@markuskramerIgitt thats an interesting an idea! but let me revisit your response to the grain cache first:
this PR Is only as it relates to runner instantiating grains and getting hit, not minion side. so you would only need to turn on the cache on your masters, you could leave it off on the minion fleet. that would short circuit the grains load on any runnerclient to the equiv of loading the serialized msgpack data.
wrt lazyloading the grains, thats probably not what id call it. its a bit confusing, but salt.loader.grains() returns a plain dict, with salt.loader.grains_funcs() returning the actual lazyloader object (equivalent to minion_mods, or any other, etc). Due to the behavior of current grains, its not a 1 to 1 of root key in grains to a given grains module/function, and they can actually overlap such that a grains function can return any data to any key it wants.
What would make sense is deferring this call, such that the dict returned by salt.loader.grains is actually a wrapper, when __getitem__ is hit it will then call salt.loader.grains(). So similar to LazyLoader, but not quite 🙂In the default case like above, i believe a runner would actually never trigger it unless you have code that reaches into __grains__ to do a thing.

@mattp-
Copy link
Contributor

mattp- commented Jun 28, 2019

@markuskramerIgitt also echoing @max-arnold im curious what you used to create that profiling output, seems very useful/nice on the eyes 🙂

@marbx marbx changed the title Option to disable Runners or MasterMinions grains to enhance performance Adding more test cases for runners and discussion on grains Jun 28, 2019
@marbx
Copy link
Contributor Author

marbx commented Jun 28, 2019

Hi @max-arnold,

Do you want to lazy-load grains on a minion?

My focus is to enhance performance of the master, but I guess lazy-loading (if feasible) should not differentiate between master and minion.

How this could affect accessing grains on a master (for example in a pillar file)?

My understanding is, that “lazy loading“ grains would happen “as late as possible”, in confront with “as soon as possible” as of now, but otherwise it must be completely transparent. I am not familiar with pillars, but the entity loading grains should also not matter.
We should create test cases in this PR, to have proof.

What you want to do by storing minion ping events in a Redis DB?

If the runner that reacts on the minion_ping finds no Redis flag all_task_done:MINON, it calls the local module all_tasks.main on that minion. Another runner, that reacts on all_tasks.main, writes the Redis flag all_task_done:MINON.

What the word "runner" means for you? I guess it is not a runner module?

I actually mean Python modules of this type (runners).

Could minions access your Redis instance directly?

I must restrict access to Redis (or maybe memcache, which seems to be installed with Salt) to the master.

Or use a returner? Could a custom engine be used instead?

Thank you, I need to assess these concepts and will come back to you.

Why it is necessary to instantiate new MasterMinion on each event?

I would not know. I don’t intentionally call them, the RunnerClient does. I only observe them in the profile output.

how did you make this nice profiling graph on a master?

Thank you, but I made them on Windows. See How to create master profiling graphs on Windows I guess the same tools could run on a master (I only have tty access to the master)

@marbx
Copy link
Contributor Author

marbx commented Jul 2, 2019

Hi @mattp

[...] you would only need to turn on the cache on your masters, you could leave it off on the minion fleet. that would short circuit the grains load on any runnerclient […]

I will try that later. The short circuit is only effective on a repetition within grains_cache_ttl seconds. I expect the master still to overload at restart, and because we have several thousand minions, only a small benefit.

[…] a wrapper, when __getitem__ is hit it will then call salt.loader.grains().

I start to understand that “deferring” and “lazy loading” are separate effects.
I would benefit of a “deferred” call. How do you implement that?

I fail to access the grains that the MasterMinion in the RunnerClient load.
How do you access them? I would like to write a test case.

@waynew waynew added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Jul 24, 2019
@marbx marbx closed this Nov 15, 2019
@marbx marbx deleted the option_omit_grains_on_master branch November 15, 2019 13:22
@marbx marbx restored the option_omit_grains_on_master branch November 15, 2019 13:39
@marbx marbx reopened this Nov 15, 2019
@marbx marbx changed the title Adding more test cases for runners and discussion on grains Documentation on skip_grains Feb 8, 2020
@marbx
Copy link
Contributor Author

marbx commented Feb 8, 2020

@waynew, I removed the test cases and turned this PR into pure documentation - could you please review?

@marbx marbx mentioned this pull request Feb 8, 2020
@waynew
Copy link
Contributor

waynew commented Feb 27, 2020

re-run all

@DmitryKuzmenko
Copy link
Contributor

@markuskramerIgitt Could you please rebase it to master branch or re-create the PR targeted on master?

@marbx
Copy link
Contributor Author

marbx commented Apr 6, 2020

Hi @DmitryKuzmenko
I re-created the PR targeted on master in #56098

@DmitryKuzmenko
Copy link
Contributor

@markuskramerIgitt thank you!

dwoz pushed a commit that referenced this pull request Apr 11, 2020
@marbx marbx deleted the option_omit_grains_on_master branch April 19, 2020 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged Reviewers-Assigned
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants