Skip to content

Troubleshooting

Mark Papadakis edited this page Mar 20, 2017 · 1 revision

Data Corruption due to disk space exhaustion

It is possible, however unlikely, that any of the segments that make up a partition is corrupt for whatever reason. You can always use tank verify to verify that your partition data files are fine. See Managing Segment Files for how that works.

In practice, other than some severe case of bit-rot or some unintentional incorrect use of some tools or utilities resulting in immutable segments corruption, which Tank already considers when executing most operations, what can happen is that you run out of storage space as events are coming in(published) and processed by Tank.

In that case, because Tank periodically(once a second, by default) calls fdatasync() it means that if a event is published by the client, retrieved by Tank and persisted, Tank will not know that a write()/pwrite() syscall would fail until after it has acknowledged success back to the client and after fdatasync() failed and subsequent write() calls also fail because there is no free storage space left. In that case, that data within that maximum time window of 1 second are likely lost or corrupt or written partially, or otherwise cannot be used or trusted. Tank will gracefully fail when new publish requests are processed, and when you restart it, that active segment will likely be corrupt. You will get a message informing you if that’s the case, and if it is, it will suggest setting the environment variable TANK_FORCE_SALVAGE_CURSEGMENT and restarting Tank.
By doing so, it will salvage all messages/data from any active segments that have been affected by the out of disk space condition and it will likely end up just trimming a few bytes/KBs from the segment - which would be the data corrupt or lost during that 1 second window.

There will be further enhancements and automation implemented specific to that process later, but currently, this works great and safeguards you from those edge case conditions(you really need to monitor your disk space availability ) at the expense of losing a few(if any) messages.

Clone this wiki locally