Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expired Locks remain in zset of digests [using "until_expired" lock] #653

Closed
francesmcmullin opened this issue Nov 11, 2021 · 9 comments · Fixed by #661
Closed

Expired Locks remain in zset of digests [using "until_expired" lock] #653

francesmcmullin opened this issue Nov 11, 2021 · 9 comments · Fixed by #661

Comments

@francesmcmullin
Copy link
Contributor

Describe the bug
When locks are setup to expire using the TTL setting (both globally and at the job level), this should keep redis memory usage stable. However, while a lock's associated string keys are appropriately expired, the digest in the uniquejobs:digests zset is never cleaned up.

(This may be a duplicate of #637 , I've made a fresh issue because that thread seems to be focused on jobs which exit early with an error, whereas I've observed this issue in the absence of any errors)

[ It's entirely possible I've misunderstood some aspect of how this gem can or should work, any corrections to my assumptions or configuration are very welcome =) ]

Expected behavior
For locks/digests which have expired, I expect all data/trace to be cleaned up, leaving old locks with a memory footprint of 0

Current behavior
Strings under uniquejobs:* and hashes of uniquejobs:*:LOCKED and uniquejobs:*:UPGRADED are all expired according to our configured TTL (24 hours), but the zset uniquejobs:digests continues to grow and includes digests many days old.

Worker class

class MyWorker
  include Sidekiq::Worker
  sidekiq_options(
    lock: :until_expired,
    on_conflict: :log,
    lock_ttl: 24 * 60 * 60,
    unique_args: ->(args) { [args[0], args[1], args[5]] }
  )
  def perform(args); end
end

Sidekiq version: 6.2.2
Sidekiq Unique Jobs version: 7.1.8

Additional context
For our use case, most of our locks will be short lived. Our job arguments usually include an ID which is at risk of being handled multiple times briefly, but will never be used again after a few days. We get no value from holding digests longer, and it's an active drain on performance and risk to uptime if our redis memory usage increases forever.

I do not know if this is the sort of thing the reaper should be cleaning up, but as far as I can tell we do have the reaper enabled (default config):

irb(main):001:0> SidekiqUniqueJobs.config.reaper
=> :ruby
irb(main):002:0> SidekiqUniqueJobs.config.reaper_count
=> 1000
irb(main):003:0> SidekiqUniqueJobs.config.reaper_interval
=> 600
irb(main):004:0> SidekiqUniqueJobs.config.reaper_timeout
=> 10

Also thanks so much for a great gem and all the work on version 7 👏🏻 👏🏻 👏🏻

@mhenrixon
Copy link
Owner

Thank you for the wonderful report (and the praise of course) 😁

Thanks to your detailed explanation, I have an idea of what could go wrong. I believe the key doesn't have any entries but the key remains. I am not yet entirely sure but I believe it might be that when I remove an entry from a hash. set or list that the key's expiration is reset. I will do some digging on this.

@francesmcmullin
Copy link
Contributor Author

This may be a complete distraction, and I don't want to rush you for a response (any time is fine), but feeding back some more info from my own research in case it's helpful.

It seems as though uniquejobs:digests is conserving data in all circumstances. Even after deleting all locks via the web UI, it still held approx 109MB of data.

It seems that individual entries in a Redis Sorted Set (AKA zset) cannot have their own expiry or TTL. However, looking at the entries in uniquejobs:digests, I found their scores all looked like epoch timestamps in seconds. I'm speculating at this stage but could these either be the timestamp when the digest was last used, or when it should be expired? To test my theory, I tried running zremrangebyscore uniquejobs:digests 1 1636641496 to remove digests more than 24 hours old, and the zset's memory usage dropped from 109MB to 13MB, which is about what I'd expect if only 24 hours worth of digests were present.

On the bright side, from my perspective, this gives me a pretty clear mid-term solution for controlling memory usage - I can just schedule a job to run zremrangebyscore with the correct range once a day and we'll be fine.

@mhenrixon
Copy link
Owner

Weird, this is the opposite of what my tests say.

I'll have to try and find a way to replicate this with a failing test.

All my current tests don't leave locks behind.

@francesmcmullin
Copy link
Contributor Author

Important note, with more info tomorrow: I believe a substantial portion of my trouble has been caused by passing a namespace through sidekiq's redis config. This relies on redis-namespace, which does it's best to invisibly add and remove the namespace as appropriate for each command, but I suspect it can't possibly anticipate and adapt to all the possibilities of Lua scripts.

I'm now removing this namespace, and will do some fresh testing tomorrow to try and figure out how much (if any!) of a problem still remains. This could potentially be a nice simple outcome, fixed with a little documentation warning against namespaces 🤞🏻

@mhenrixon
Copy link
Owner

@francesmcmullin the redis namespace shouldn't be a problem as I have others using it. This is something else entirely but I am working on finding out exactly what.

@francesmcmullin
Copy link
Contributor Author

francesmcmullin commented Nov 29, 2021

Thanks for that tip! Just a small update on my end, likely not news for you but just documenting my own understanding:

  • You were right of course, removing the namespace did not resolve the problem I described initially, although it did simplify some inspection steps.
  • I tried to narrow the problem down by adding a spec with similar behaviour and found:
    • A spec that simply generates an :until_expired lock and then sleeps ttl seconds (I used 3, for convenience) will find the lock remains inside the digests zset.
    • However, if you include SidekiqUniqueJobs::Orphans::Reaper.call, then the digest is removed 🎉
  • I guessed that, for some reason, the reaper may not be running on my instance.
    • triple checked my config, it matches the docs and the output I included in the issue description is accurate
    • Watching Sidekiq.redis { |r| r.memory :usage, 'uniquejobs:digests' } and Sidekiq.redis { |r| r.get 'uniquejobs:reaper' }, I can even see the reaper run every 10 minutes, and sure enough, digests are cleaned up!
    • The above is true for "normal" conditions, low traffic and most jobs actually use different locks (not :until_expired). Our usage includes bursts/surges when large numbers (for us - hundreds of thousands) of jobs using :until_expired are processed. I am expecting one of these today, and will be able to observe how the reaper responds after that. My current theory is that it may be overwhelmed with the volume in some way, and possibly configuring the reaper resurrector, switching to lua reaper, or changing the interval will help.
    • A little arithmetic suggests we definitely need to reduce the reaper interval and/or increase the reaper count - with default config it will take us 16.6 hours (100 * 10 mins) to clean up 100,000 digests. This seems significant but I would expect to see memory usage drop slowly, not stall. I suspect that under certain circumstances, instead of reaping slowly, but steadily, the reaper is overwhelmed entirely and cannot recover. I'll be able to test this idea tomorrow

@francesmcmullin
Copy link
Contributor Author

Good news! 🎉 I am quite confident I have discovered our problem. I believe that, once you have more orphan digests than your reaper_count limit, the ruby reaper's #orphans method breaks and stops returning any orphans. The method looks like this:

def orphans
conn.zrevrange(digests.key, 0, -1).each_with_object([]) do |digest, memo|
next if belongs_to_job?(digest)
memo << digest
break if memo.size >= reaper_count
end
end

Confusingly, imo, ruby's each_with_object discards the memo if you use break to exit the loop. Fortunately, it is possible to pass a return object to break, so we can fix it like this:

   break memo if memo.size >= reaper_count

If it's helpful, I can try to write up a PR with a spec for this scenario. Of course, I may have misunderstood something about the internals or intentions with the reaper, in which case I'll happily follow your direction instead.

@mhenrixon
Copy link
Owner

You are right of course, I'll see if I can write a failing test for this scenario!

mhenrixon added a commit that referenced this issue Nov 30, 2021
Closes #653

The problem was that the reaper returns nil instead of the memo. Many thanks to @francesmcmullin for the wonderfully detailed report and investigation. Now we have a regression test for this behavior.
mhenrixon added a commit that referenced this issue Nov 30, 2021
Closes #653

The problem was that the reaper returns nil instead of the memo. Many thanks to @francesmcmullin for the wonderfully detailed report and investigation. Now we have a regression test for this behavior.
mhenrixon added a commit that referenced this issue Nov 30, 2021
* Fix edge case in ruby reaper

Closes #653

The problem was that the reaper returns nil instead of the memo. Many thanks to @francesmcmullin for the wonderfully detailed report and investigation. Now we have a regression test for this behavior.

* Mandatory rubocop commit
@mhenrixon
Copy link
Owner

Released as v7.1.11 @francesmcmullin thank you so much for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants