Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hive corrupt box analysis and root cause #378

Closed
2 of 3 tasks
murali-shris opened this issue Nov 3, 2021 · 15 comments
Closed
2 of 3 tasks

hive corrupt box analysis and root cause #378

murali-shris opened this issue Nov 3, 2021 · 15 comments
Assignees
Labels
8 SP 8 Story Points - 5 Days Large bug Something isn't working Data Integrity Issue related to data integrity P1 Priority 1 PR25 Nov 2021 Sprint Planning PR26 Nov | Dec 2021 Sprint Planning

Comments

@murali-shris
Copy link
Member

murali-shris commented Nov 3, 2021

  • analyse hive corrupt box issue happening on production
  • figure out root cause(any data pattern causing the issue or code related to hive openbox/compaction)
  • functional tests to cover multi threaded writes
@murali-shris murali-shris self-assigned this Nov 3, 2021
@murali-shris murali-shris added PR24 Nov 2021 Sprint Planning P1 Priority 1 8 SP 8 Story Points - 5 Days Large labels Nov 3, 2021
@murali-shris
Copy link
Member Author

analysis so far

  • went through hive implementation to check how recovery is happening. Hive marks a byte until which data can be read. Rest of the corrupt data is getting truncated.
  • went through the past commits since hive lazy box push to prod(approx 30 days), to see whether any multiple writes (possible cause of corrupted box) happen at the same time to hive box. No anomalies found yet on this aspect
  • tried switching box type from in memory to lazy and then back to in memory. Unable to replicate the issue.

@murali-shris
Copy link
Member Author

  • analysed internals of hive in reading, writing and compacting data
  • requested for hive storage backup to check which exact key is causing corruption

@murali-shris murali-shris added 13 SP 13 Story Points - 8 to 10 Days XL and removed 8 SP 8 Story Points - 5 Days Large labels Nov 15, 2021
@murali-shris
Copy link
Member Author

  • mutliple open issues in hive repo with the same exception. Many users have faced this issue, but there are no definite steps to replicate the issue. Issue could happen due to multiple threads writing to box or during compaction

@murali-shris murali-shris added 8 SP 8 Story Points - 5 Days Large and removed 13 SP 13 Story Points - 8 to 10 Days XL labels Nov 15, 2021
@gkc gkc reopened this Nov 15, 2021
@gkc
Copy link
Contributor

gkc commented Nov 15, 2021

Hi Murali - you mention "Issue could happen due to multiple threads writing to box or during compaction" - two questions
(1) As I understand it there is no current possibility of multiple concurrent writes to the box during normal operation - is my understanding correct? Is this verified in functional tests?
(2) Are we guarding against concurrent writes when a normal update operation happens to coincide with a hive compaction? Is this verified in functional tests?

@murali-shris
Copy link
Member Author

@gkc #1 and #2 are not covered in functional tests. Will take this up.

@gkc
Copy link
Contributor

gkc commented Nov 15, 2021

Thanks @murali-shris ... I'm quite interested in this part of the service ... I would like to tackle adding the tests, will be a good opportunity for me to learn about how all of the persistence works.

@gkc gkc assigned gkc and unassigned murali-shris Nov 15, 2021
@gkc gkc added PR25 Nov 2021 Sprint Planning bug Something isn't working Data Integrity Issue related to data integrity and removed PR24 Nov 2021 Sprint Planning labels Nov 15, 2021
@ksanty
Copy link
Member

ksanty commented Nov 15, 2021

@murali-shris love the lists! One request, can you please use check boxes, so we can report progress against our tasks? TY so much!

@murali-shris
Copy link
Member Author

@cpswan I have added additional logging to detect which frame is corrupted.
server branch "hive_corruption_logging"
Please deploy to the secondaries - daily3bossanova and philosophical75

@cpswan
Copy link
Member

cpswan commented Nov 17, 2021

@murali-shris atsigncompany/secondary:hivelog image created from at_server branch hive_corruption_logging and deployed to daily3bossanova and philosophical75

@gkc
Copy link
Contributor

gkc commented Nov 24, 2021

Re functional tests to cover multi-threaded writes:

I've implemented some simple parallel load tests (four separate Unix processes each running four Isolates generating a mix of Update requests on several hundred keys with string values varying in length from 10 bytes to 110,000 bytes) and have observed that the requests are always handled entirely in sequence at the server - i.e. one request is fully handled before starting on the next request. Even after many hours of running heavy concurrent client load there is no HiveBox corruption evident upon restart.

@gkc
Copy link
Contributor

gkc commented Nov 24, 2021

Next step on the concurrency hypothesis: extend the test by adding some key expirations and get some expired key deletions happening while under heavy client load. (Expired key deletions will run within the same Isolate when the run triggers - triggered by the Cron package which internally uses Dart Timers)

@gkc
Copy link
Contributor

gkc commented Nov 25, 2021

@murali-shris I was thinking of maybe first trying the above as a unit test where I can exert direct control on the server and maybe more easily force a key deletion to happen concurrently, what do you think?

@gkc
Copy link
Contributor

gkc commented Nov 25, 2021

Having spoken with Jagan - in addition to what's mentioned in previous comment, will also look to test writes concurrent with the hive internal compaction; also forcing server exit during sync, etc

@yahu1031 yahu1031 added 8 SP 8 Story Points - 5 Days Large and removed 8 SP 8 Story Points - 5 Days Large labels Nov 29, 2021
@ksanty ksanty added 5 SP 5 Story Points - 3 Days Medium PR26 Nov | Dec 2021 Sprint Planning and removed 8 SP 8 Story Points - 5 Days Large labels Dec 3, 2021
@murali-shris
Copy link
Member Author

how internal variables of hive move around when frames are read
https://docs.google.com/spreadsheets/d/1uESbDec3v60B6ABLT63yksKA8SEhPzpdi5Hs3au3ayg/edit

@murali-shris
Copy link
Member Author

Proposed hive maintainer for changes in code to handle crash recovery.
isar/hive#263
Decision was made to explore hive alternatives for the long run and stop work on this issue.

@murali-shris murali-shris added 8 SP 8 Story Points - 5 Days Large and removed 5 SP 5 Story Points - 3 Days Medium labels Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8 SP 8 Story Points - 5 Days Large bug Something isn't working Data Integrity Issue related to data integrity P1 Priority 1 PR25 Nov 2021 Sprint Planning PR26 Nov | Dec 2021 Sprint Planning
Projects
None yet
Development

No branches or pull requests

5 participants