-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Write manifest file #35049
[Zen2] Write manifest file #35049
Conversation
Pinging @elastic/es-distributed |
# Conflicts: # server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first pass and left a handful of thoughts.
server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/MetaStateService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/WriteStateException.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/MetaStateService.java
Outdated
Show resolved
Hide resolved
@DaveCTurner I've fixed issues in 421caa1 + added some minor fixes not mentioned in your comments. Ready for the second pass. |
@ywelsch I've reopened the PR and made necessary fixes, except not writing non-upgraded index metadata on startup in non bwc mode, which I'll implement tomorrow. Could you please review it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done an initial pass. Have not looked in details at the tests yet.
server/src/main/java/org/elasticsearch/cluster/metadata/Manifest.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/admin/indices/create/CreateIndexIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/GatewayMetaStateTests.java
Outdated
Show resolved
Hide resolved
@ywelsch I was working on the tests/bugfixes during the night. And now it's ready for the 3rd pass. |
@@ -224,14 +241,17 @@ public final void write(final T state, final Path... locations) throws WriteStat | |||
copyStateToExtraLocations(directories, tmpFileName); | |||
performRenames(tmpFileName, fileName, directories); | |||
performStateDirectoriesFsync(directories); | |||
} catch (WriteStateException e) { | |||
cleanupOldFiles(oldGenerationId, locations); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is dangerous and can lead to data loss. Assume that you've successfully written a cluster state that contains an index "test" with state file of generation 1. Then you try to write an updated cluster state for index with state file generation 2. Writing the state file for the index is successful, but there's a failure later to write the state file for another index. Now the node crashes or the clean-up logic fails. When you then handle the next cluster state, you will try to write generation 3, which, if it fails, will clean up everything except generation 2.
I think that within MetaDataStateFormat, you can only clean up files within the write method that you know that you have written. In particular, you can't assume (given the manifest-based approach) that the highest generation you see here is the file to keep around (as the manifest might not be pointing to the one with the highest generation).
This clean-up logic will have to be handled at a higher level (GatewayMetaState)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good catch! Thank you! The reason why I've implemented it this way, because I DO know generations of global metadata and index metadata from the manifest file, but I DON'T know the previous generation of the manifest file to perform cleanup of newly created manifest state file if the manifest write fails. I really would like to avoid returning manifest generation from loadFullState, because it already returns Tuple. Adding generation field to Manifest itself also feels wrong. So instead I've done the work in 2 commits:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few more comments
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/GatewayMetaStateTests.java
Show resolved
Hide resolved
ca44b4d
to
0e63bdc
Compare
…s are stopped (#35494) Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart onNodeStopped is only called after the nodes are actually all stopped (and in particular not while starting up some nodes again). This change also ensures that a closed node client is not being used anymore (which required a small change to a test). Relates to #35049
…s are stopped (#35494) Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart onNodeStopped is only called after the nodes are actually all stopped (and in particular not while starting up some nodes again). This change also ensures that a closed node client is not being used anymore (which required a small change to a test). Relates to #35049
…ating nodes" This reverts commit f21ddf9
…s are stopped (#35494) Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart onNodeStopped is only called after the nodes are actually all stopped (and in particular not while starting up some nodes again). This change also ensures that a closed node client is not being used anymore (which required a small change to a test). Relates to #35049
@ywelsch I've made required fixes and added comments to 2 remaining issues. It's now ready for the next pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made one more request for change and some nits.
@@ -224,14 +244,23 @@ public final void write(final T state, final Path... locations) throws WriteStat | |||
copyStateToExtraLocations(directories, tmpFileName); | |||
performRenames(tmpFileName, fileName, directories); | |||
performStateDirectoriesFsync(directories); | |||
} catch (WriteStateException e) { | |||
if (cleanup) { | |||
cleanupOldFiles(oldGenerationId, locations); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not really needed for this PR (where we only call write
, but not writeAndCleanup
from GatewayMetaState, with one exception, see below). I think we should leave this extra clean-up out of this PR, because it might introduce regressions to other parts of the system, which are also using MetaDataStateFormat.
There is one exception, writeManifestAndCleanup
. We can change that one as well to only writeManifest
and do the clean up in the same way in GatewayMetaState as for the other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment above why I've done it in this way. Shortly, the reason why I'm using writeManifestAndCleanup, because I don't know manifest file generation. If manifest write succeeds I get current generation returned by the method call, if it fails - I don't have it.
In general, I don't think there is a problem with other callers calling writeAndCleanup, because they don't use the concept of manifest and they want to keep the previously written latest generation.
If you're really afraid of introducing a regression, I can suggest adding "oldGenerationId" to WriteStateException, that when writing manifest failed we can learn which file to delete. IMHO, this is not needed and cleaning leaving only previous generation of state files for all uses of writeAndCleanup is the right thing to do.
server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/MetaStateService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/GatewayMetaStateTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
run gradle build tests please |
@ywelsch thanks for your reviewing efforts! |
Elasticsearch node is responsible for storing cluster metadata.
There are 2 types of metadata: global metadata and index metadata.
GatewayMetaState
implementsClusterStateApplier
and receives allClusterStateChanged
events and is responsible for storing modifiedmetadata to disk.
When new
ClusterStateChanged
event is received,GatewayMetaState
checks if global metadata has changed and if it's the case writes new
global metadata to disk. After that
GatewayMetaState
checks if indexmetadata has changed or there are new indices assigned to this node and
if it's the case writes new index metadata to disk. Atomicity of global
metadata and index metadata writes is ensured by
MetaDataStateFormat
class.
Unfortunately, there is no atomicity when more than one metadata changes
(global and index, or metadata for two indices). And atomicity is
important for Zen2 correctness.
This commit adds atomicity by adding a notion of manifest file,
represented by
MetaState
class.MetaState
contains pointers tocurrent metadata.
More precisely, it stores global state generation as long and map from
Index
to index metadata generation as long. Atomicity of writes formanifest file is ensured by
MetaStateFormat
class.The algorithm of writing changes to the disk would be the following:
it's generation.
it's generation. For each not-changed index use generation from
previous manifest file. If index is removed or this node is no longer
responsible for this index - forget about the index.
MetaState
object using previously remembered generations andwrite it to disk.
manifest.
Additonally new implementation relies on enhanced
MetaDataStateFormat
failure semantics,
applyClusterState
throws IOException, whosedescendant
WriteStateException
could be (and should be in Zen2)explicitly handled.