-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Write manifest file (new implementation) #35199
Conversation
Pinging @elastic/es-distributed |
CI failure is sporadic and addressed by #35198. @elasticmachine retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, much easier to review, partly thanks to our conversation last week and partly because there are fewer changes. I have left a number of comments, mainly small points about naming and other nits, but also asking for some stronger assertions or other tweaks to the tests. The overall shape of this change is good.
} | ||
} | ||
|
||
private static final String META_STATE_FILE_PREFIX = "meta-"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we overuse the prefix meta-
in the names of things. The class name MetaState
makes me a bit uneasy: there's already lots of Meta*
classes that I find hard to keep track of. We've been calling this a manifest, and I think we should use this terminology in the code to better describe what this class is actually doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
private static final class IndexEntry implements ToXContentFragment { | ||
private static final ParseField INDEX_GENERATION_PARSE_FIELD = new ParseField("generation"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could you reverse the order of these two fields so they're in the same order as they appear in the parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*/ | ||
public class MetaState implements ToXContentFragment { | ||
private final long globalStateGeneration; | ||
private final Map<Index, Long> indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps globalMetadataGeneration
and indexMetadataGeneration
? Maybe the Metadata
in the middle is unncessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
globalGeneration and indexGeneration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
public static MetaState empty() { | ||
return new MetaState(-1, Collections.emptyMap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think -1
should be a named constant (e.g. MANIFEST_MISSING
) and we should have assertions that we never write out a manifest file with such a generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed after MetaStateService refactoring
} | ||
for (IndexMetaData indexMetaData : upgradedMetaData) { | ||
if (metaData.hasIndexMetaData(indexMetaData) == false) { | ||
metaStateService.writeIndex("upgrade", indexMetaData); | ||
} else { | ||
metaStateService.writeIndex("startup", indexMetaData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to write out an unchanged index metadata file if the manifest file was missing. If it's present in a manifest file then we know it was fsynced, and I think rewriting every index's metadata on startup will substantially slow down the startup of a node. (The same is true of the global metadata, but the effects of that on startup time will be less pronounced.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added isBwcMode to MetaStateService to check this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertThat(loadedMetaData.index("index"), equalTo(index_v2)); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra whitespace here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import org.elasticsearch.test.ESTestCase; | ||
|
||
import static org.hamcrest.Matchers.equalTo; | ||
import static org.hamcrest.Matchers.is; | ||
import static org.hamcrest.Matchers.nullValue; | ||
|
||
public class MetaStateServiceTests extends ESTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tests should assert that the MetaStateService
is in its reset state after writing the manifest file. Maybe they should also sometimes construct a new MetaStateService
when reading things back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*/ | ||
public static class FullRestartCallback extends RestartCallback { | ||
public void onAllNodesStopped(List<String> nodeNames) throws Exception { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertThat(loadedMetaData.index("index"), equalTo(index_v1)); | ||
|
||
MetaState.FORMAT.cleanupOldFiles(Long.MAX_VALUE, env.nodeDataPaths()); // this will erase manifest file | ||
loadedMetaData = metaStateService.loadMetaData(); //this must load new metadata, because manifest file is gone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a different MetaStateService
instance again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetaState.FORMAT.cleanupOldFiles(Long.MAX_VALUE, env.nodeDataPaths()); // this will erase manifest file | ||
loadedMetaData = metaStateService.loadMetaData(); //this must load new metadata, because manifest file is gone | ||
assertTrue(MetaData.isGlobalStateEquals(loadedMetaData, metaData_v2)); | ||
assertThat(loadedMetaData.index("index"), equalTo(index_v2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also verify that we can correctly go from not having a manifest file to having one, because that's the sequence that we want to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaveCTurner thank you for the review. I've resolved your comments and have performed the major refactoring of MetaStateService - now it only loads Manifest file and MetaData on the startup and after that keeps track of them in memory. The previous implementation was incorrect because loadMetadata could be called concurrently with applying new cluster event. Could you do you please make a second pass? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to this PR, MetaStateService was stateless and only GatewayMetaState implemented the logic for incrementally writing out the state. Now this seems to be somewhat shared with the two classes being stateful. I don't see the benefit of having this extra level of abstraction that's now offered by MetaStateService. Can we fold those two classes together into one?
On a related note, I think we can simplify the implementation by only writing the cluster state atomically on master-eligible nodes. On data-only nodes (which have the extra logic for closed indices as well as only writing out the index metadata for the indices that they have) I think it's ok if we don't write out the cluster state atomically, as it's only used for dangling indices functionality, which already has weak consistency guarantees.
With the two points above, I think we should look at having two separate methods for master-eligible and data-only nodes, where both can hopefully be simpler. Let's discuss this tomorrow.
return (Long) generationAndListOfIndexEntries[0]; | ||
} | ||
|
||
private static Map<Index, Long> indicies(Object[] generationAndListOfIndexEntries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/indicies/indices
} | ||
|
||
private static final ConstructingObjectParser<Manifest, Void> PARSER = new ConstructingObjectParser<>( | ||
"state", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/state/manifest?
if (MetaData.isGlobalStateEquals(metaData, upgradedMetaData) == false) { | ||
metaStateService.writeGlobalState("upgrade", upgradedMetaData); | ||
} else { | ||
if (metaStateService.isBwcMode()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to expose bwc mode? Could we instead just do the keepOrWrite
internally based on BWC mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetaStateService
does not perform metadata upgrade on startup. Actually, it never initiates writes itself. It's the responsibility to GatewayMetaState
to understand what needs to be written. Here we have pretty sophisticated requirement that we want to write non-upgraded metadata only if we're working in BWC mode. In case we're working in non-BWC mode, we want to re-write manifest only. That's why this flag is exposed.
} catch (Exception e) { | ||
logger.error("failed to read local state, exiting...", e); | ||
logger.error("failed to bootstrap, exiting...", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why call this bootstrap? Maybe "failed to process on-disk cluster state at startup"?
public MetaData getMetaData() { | ||
MetaData metaData = metaStateService.getMetaData(); | ||
//although metadata returned by MetaStateService is nullable, our callers expect non-nullable metadata | ||
return metaData == null ? MetaData.builder().build() : metaData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetaData.EMPTY_META_DATA
throw e; | ||
} | ||
} | ||
} | ||
|
||
public MetaData loadMetaState() throws IOException { | ||
return metaStateService.loadFullState(); | ||
public MetaData getMetaData() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this getMetaDataOrDefault()
metaStateService.writeManifest("changed"); | ||
previousMetaData = newMetaData; | ||
} catch (WriteStateException e) { | ||
logger.error("Exception occurred", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you provide a better exception message here? e.g. "failed to persist cluster state " + event.state().version()
@ywelsch I've resolved issues that you pointed out to inline in aae96cc. |
Elasticsearch node is responsible for storing cluster metadata.
There are 2 types of metadata: global metadata and index metadata.
GatewayMetaState implements ClusterStateApplier and receives all
ClusterStateChanged events and is responsible for storing modified
metadata to disk.
When new ClusterStateChanged event is received, GatewayMetaState
checks if global metadata has changed and if it's the case writes new
global metadata to disk. After that GatewayMetaState checks if index
metadata has changed or there are new indices assigned to this node and
if it's the case writes new index metadata to disk. Atomicity of global
metadata and index metadata writes is ensured by MetaDataStateFormat
class.
Unfortunately, there is no atomicity when more when one metadata changes
(global and index, or metadata for two indices). And atomicity is
important for Zen2 correctness.
This commit adds atomicity by adding a notion of manifest file,
represented by MetaState class. MetaState contains pointers to
current metadata.
More precisely, it stores global state generation as long and map from
Index to index metadata generation as long. Atomicity of writes for
manifest file is ensured by MetaStateFormat class.
The algorithm of writing changes to the disk would be the following:
it's generation.
it's generation. For each not-changed index use generation from
previous manifest file. If index is removed or this node is no longer
responsible for this index - forget about the index.
write it to disk.
manifest.
Additonally new implementation relies on enhanced MetaDataStateFormat
failure semantics, applyClusterState throws IOException, whose
descendant WriteStateException could be (and should be in Zen2)
explicitly handled.
Obsoletes #35049. This PR
tries to minimize number of changes to GatewayMetaState to make it
easier to review.