-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hive corrupt box analysis and root cause #378
Comments
analysis so far
|
|
|
Hi Murali - you mention "Issue could happen due to multiple threads writing to box or during compaction" - two questions |
Thanks @murali-shris ... I'm quite interested in this part of the service ... I would like to tackle adding the tests, will be a good opportunity for me to learn about how all of the persistence works. |
@murali-shris love the lists! One request, can you please use check boxes, so we can report progress against our tasks? TY so much! |
@cpswan I have added additional logging to detect which frame is corrupted. |
@murali-shris atsigncompany/secondary:hivelog image created from at_server branch hive_corruption_logging and deployed to daily3bossanova and philosophical75 |
Re functional tests to cover multi-threaded writes: I've implemented some simple parallel load tests (four separate Unix processes each running four Isolates generating a mix of Update requests on several hundred keys with string values varying in length from 10 bytes to 110,000 bytes) and have observed that the requests are always handled entirely in sequence at the server - i.e. one request is fully handled before starting on the next request. Even after many hours of running heavy concurrent client load there is no HiveBox corruption evident upon restart. |
Next step on the concurrency hypothesis: extend the test by adding some key expirations and get some expired key deletions happening while under heavy client load. (Expired key deletions will run within the same Isolate when the run triggers - triggered by the Cron package which internally uses Dart Timers) |
@murali-shris I was thinking of maybe first trying the above as a unit test where I can exert direct control on the server and maybe more easily force a key deletion to happen concurrently, what do you think? |
Having spoken with Jagan - in addition to what's mentioned in previous comment, will also look to test writes concurrent with the hive internal compaction; also forcing server exit during sync, etc |
how internal variables of hive move around when frames are read |
Proposed hive maintainer for changes in code to handle crash recovery. |
The text was updated successfully, but these errors were encountered: