Expired Locks remain in zset of digests [using "until_expired" lock] #653

francesmcmullin · 2021-11-11T12:11:32Z

Describe the bug
When locks are setup to expire using the TTL setting (both globally and at the job level), this should keep redis memory usage stable. However, while a lock's associated string keys are appropriately expired, the digest in the uniquejobs:digests zset is never cleaned up.

(This may be a duplicate of #637 , I've made a fresh issue because that thread seems to be focused on jobs which exit early with an error, whereas I've observed this issue in the absence of any errors)

[ It's entirely possible I've misunderstood some aspect of how this gem can or should work, any corrections to my assumptions or configuration are very welcome =) ]

Expected behavior
For locks/digests which have expired, I expect all data/trace to be cleaned up, leaving old locks with a memory footprint of 0

Current behavior
Strings under uniquejobs:* and hashes of uniquejobs:*:LOCKED and uniquejobs:*:UPGRADED are all expired according to our configured TTL (24 hours), but the zset uniquejobs:digests continues to grow and includes digests many days old.

Worker class

class MyWorker
  include Sidekiq::Worker
  sidekiq_options(
    lock: :until_expired,
    on_conflict: :log,
    lock_ttl: 24 * 60 * 60,
    unique_args: ->(args) { [args[0], args[1], args[5]] }
  )
  def perform(args); end
end

Sidekiq version: 6.2.2
Sidekiq Unique Jobs version: 7.1.8

Additional context
For our use case, most of our locks will be short lived. Our job arguments usually include an ID which is at risk of being handled multiple times briefly, but will never be used again after a few days. We get no value from holding digests longer, and it's an active drain on performance and risk to uptime if our redis memory usage increases forever.

I do not know if this is the sort of thing the reaper should be cleaning up, but as far as I can tell we do have the reaper enabled (default config):

irb(main):001:0> SidekiqUniqueJobs.config.reaper
=> :ruby
irb(main):002:0> SidekiqUniqueJobs.config.reaper_count
=> 1000
irb(main):003:0> SidekiqUniqueJobs.config.reaper_interval
=> 600
irb(main):004:0> SidekiqUniqueJobs.config.reaper_timeout
=> 10

Also thanks so much for a great gem and all the work on version 7 👏🏻 👏🏻 👏🏻

The text was updated successfully, but these errors were encountered:

mhenrixon · 2021-11-11T13:44:12Z

Thank you for the wonderful report (and the praise of course) 😁

Thanks to your detailed explanation, I have an idea of what could go wrong. I believe the key doesn't have any entries but the key remains. I am not yet entirely sure but I believe it might be that when I remove an entry from a hash. set or list that the key's expiration is reset. I will do some digging on this.

francesmcmullin · 2021-11-12T14:56:35Z

This may be a complete distraction, and I don't want to rush you for a response (any time is fine), but feeding back some more info from my own research in case it's helpful.

It seems as though uniquejobs:digests is conserving data in all circumstances. Even after deleting all locks via the web UI, it still held approx 109MB of data.

It seems that individual entries in a Redis Sorted Set (AKA zset) cannot have their own expiry or TTL. However, looking at the entries in uniquejobs:digests, I found their scores all looked like epoch timestamps in seconds. I'm speculating at this stage but could these either be the timestamp when the digest was last used, or when it should be expired? To test my theory, I tried running zremrangebyscore uniquejobs:digests 1 1636641496 to remove digests more than 24 hours old, and the zset's memory usage dropped from 109MB to 13MB, which is about what I'd expect if only 24 hours worth of digests were present.

On the bright side, from my perspective, this gives me a pretty clear mid-term solution for controlling memory usage - I can just schedule a job to run zremrangebyscore with the correct range once a day and we'll be fine.

mhenrixon · 2021-11-12T20:46:15Z

Weird, this is the opposite of what my tests say.

I'll have to try and find a way to replicate this with a failing test.

All my current tests don't leave locks behind.

francesmcmullin · 2021-11-18T17:36:31Z

Important note, with more info tomorrow: I believe a substantial portion of my trouble has been caused by passing a namespace through sidekiq's redis config. This relies on redis-namespace, which does it's best to invisibly add and remove the namespace as appropriate for each command, but I suspect it can't possibly anticipate and adapt to all the possibilities of Lua scripts.

I'm now removing this namespace, and will do some fresh testing tomorrow to try and figure out how much (if any!) of a problem still remains. This could potentially be a nice simple outcome, fixed with a little documentation warning against namespaces 🤞🏻

mhenrixon · 2021-11-22T13:12:35Z

@francesmcmullin the redis namespace shouldn't be a problem as I have others using it. This is something else entirely but I am working on finding out exactly what.

francesmcmullin · 2021-11-29T15:54:39Z

Thanks for that tip! Just a small update on my end, likely not news for you but just documenting my own understanding:

You were right of course, removing the namespace did not resolve the problem I described initially, although it did simplify some inspection steps.
I tried to narrow the problem down by adding a spec with similar behaviour and found:
- A spec that simply generates an :until_expired lock and then sleeps ttl seconds (I used 3, for convenience) will find the lock remains inside the digests zset.
- However, if you include SidekiqUniqueJobs::Orphans::Reaper.call, then the digest is removed 🎉
I guessed that, for some reason, the reaper may not be running on my instance.
- triple checked my config, it matches the docs and the output I included in the issue description is accurate
- Watching Sidekiq.redis { |r| r.memory :usage, 'uniquejobs:digests' } and Sidekiq.redis { |r| r.get 'uniquejobs:reaper' }, I can even see the reaper run every 10 minutes, and sure enough, digests are cleaned up!
- The above is true for "normal" conditions, low traffic and most jobs actually use different locks (not :until_expired). Our usage includes bursts/surges when large numbers (for us - hundreds of thousands) of jobs using :until_expired are processed. I am expecting one of these today, and will be able to observe how the reaper responds after that. My current theory is that it may be overwhelmed with the volume in some way, and possibly configuring the reaper resurrector, switching to lua reaper, or changing the interval will help.
- A little arithmetic suggests we definitely need to reduce the reaper interval and/or increase the reaper count - with default config it will take us 16.6 hours (100 * 10 mins) to clean up 100,000 digests. This seems significant but I would expect to see memory usage drop slowly, not stall. I suspect that under certain circumstances, instead of reaping slowly, but steadily, the reaper is overwhelmed entirely and cannot recover. I'll be able to test this idea tomorrow

francesmcmullin · 2021-11-29T20:20:22Z

Good news! 🎉 I am quite confident I have discovered our problem. I believe that, once you have more orphan digests than your reaper_count limit, the ruby reaper's #orphans method breaks and stops returning any orphans. The method looks like this:

sidekiq-unique-jobs/lib/sidekiq_unique_jobs/orphans/ruby_reaper.rb

Lines 57 to 64 in 94a203a

    
           def orphans 
        
             conn.zrevrange(digests.key, 0, -1).each_with_object([]) do |digest, memo| 
        
               next if belongs_to_job?(digest) 
        
               memo << digest 
        
               break if memo.size >= reaper_count 
        
             end 
        
           end

Confusingly, imo, ruby's each_with_object discards the memo if you use break to exit the loop. Fortunately, it is possible to pass a return object to break, so we can fix it like this:

   break memo if memo.size >= reaper_count

If it's helpful, I can try to write up a PR with a spec for this scenario. Of course, I may have misunderstood something about the internals or intentions with the reaper, in which case I'll happily follow your direction instead.

mhenrixon · 2021-11-30T06:18:24Z

You are right of course, I'll see if I can write a failing test for this scenario!

@francesmcmullin

Closes #653 The problem was that the reaper returns nil instead of the memo. Many thanks to @francesmcmullin for the wonderfully detailed report and investigation. Now we have a regression test for this behavior.

@francesmcmullin

Closes #653 The problem was that the reaper returns nil instead of the memo. Many thanks to @francesmcmullin for the wonderfully detailed report and investigation. Now we have a regression test for this behavior.

@francesmcmullin

* Fix edge case in ruby reaper Closes #653 The problem was that the reaper returns nil instead of the memo. Many thanks to @francesmcmullin for the wonderfully detailed report and investigation. Now we have a regression test for this behavior. * Mandatory rubocop commit

mhenrixon · 2021-11-30T10:21:10Z

Released as v7.1.11 @francesmcmullin thank you so much for the help!

mhenrixon mentioned this issue Nov 18, 2021

Digest not removed on time or never #659

Closed

mhenrixon mentioned this issue Nov 30, 2021

Fix ruby reaper edge case #661

Merged

mhenrixon closed this as completed in #661 Nov 30, 2021

francesmcmullin mentioned this issue Dec 1, 2021

Ruby Reaper deletes :until_expired locks before their TTL has elapsed #662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expired Locks remain in zset of digests [using "until_expired" lock] #653

Expired Locks remain in zset of digests [using "until_expired" lock] #653

francesmcmullin commented Nov 11, 2021

mhenrixon commented Nov 11, 2021

francesmcmullin commented Nov 12, 2021

mhenrixon commented Nov 12, 2021

francesmcmullin commented Nov 18, 2021

mhenrixon commented Nov 22, 2021

francesmcmullin commented Nov 29, 2021 •

edited

Loading

francesmcmullin commented Nov 29, 2021

mhenrixon commented Nov 30, 2021

mhenrixon commented Nov 30, 2021

Expired Locks remain in zset of digests [using "until_expired" lock] #653

Expired Locks remain in zset of digests [using "until_expired" lock] #653

Comments

francesmcmullin commented Nov 11, 2021

mhenrixon commented Nov 11, 2021

francesmcmullin commented Nov 12, 2021

mhenrixon commented Nov 12, 2021

francesmcmullin commented Nov 18, 2021

mhenrixon commented Nov 22, 2021

francesmcmullin commented Nov 29, 2021 • edited Loading

francesmcmullin commented Nov 29, 2021

mhenrixon commented Nov 30, 2021

mhenrixon commented Nov 30, 2021

francesmcmullin commented Nov 29, 2021 •

edited

Loading