Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zen2] Write manifest file (new implementation) #35199

Closed
wants to merge 25 commits into from

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented Nov 2, 2018

Elasticsearch node is responsible for storing cluster metadata.
There are 2 types of metadata: global metadata and index metadata.
GatewayMetaState implements ClusterStateApplier and receives all
ClusterStateChanged events and is responsible for storing modified
metadata to disk.

When new ClusterStateChanged event is received, GatewayMetaState
checks if global metadata has changed and if it's the case writes new
global metadata to disk. After that GatewayMetaState checks if index
metadata has changed or there are new indices assigned to this node and
if it's the case writes new index metadata to disk. Atomicity of global
metadata and index metadata writes is ensured by MetaDataStateFormat
class.

Unfortunately, there is no atomicity when more when one metadata changes
(global and index, or metadata for two indices). And atomicity is
important for Zen2 correctness.
This commit adds atomicity by adding a notion of manifest file,
represented by MetaState class. MetaState contains pointers to
current metadata.
More precisely, it stores global state generation as long and map from
Index to index metadata generation as long. Atomicity of writes for
manifest file is ensured by MetaStateFormat class.

The algorithm of writing changes to the disk would be the following:

  1. Write global metadata state file to disk and remember
    it's generation.
  2. For each new/changed index write state file to disk and remember
    it's generation. For each not-changed index use generation from
    previous manifest file. If index is removed or this node is no longer
    responsible for this index - forget about the index.
  3. Create MetaState object using previously remembered generations and
    write it to disk.
  4. Remove old state files for global metadata, indices metadata and
    manifest.
    Additonally new implementation relies on enhanced MetaDataStateFormat
    failure semantics, applyClusterState throws IOException, whose
    descendant WriteStateException could be (and should be in Zen2)
    explicitly handled.

Obsoletes #35049. This PR
tries to minimize number of changes to GatewayMetaState to make it
easier to review.

@andrershov andrershov added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Nov 2, 2018
@andrershov andrershov self-assigned this Nov 2, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

CI failure is sporadic and addressed by #35198. @elasticmachine retest this please.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, much easier to review, partly thanks to our conversation last week and partly because there are fewer changes. I have left a number of comments, mainly small points about naming and other nits, but also asking for some stronger assertions or other tweaks to the tests. The overall shape of this change is good.

}
}

private static final String META_STATE_FILE_PREFIX = "meta-";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we overuse the prefix meta- in the names of things. The class name MetaState makes me a bit uneasy: there's already lots of Meta* classes that I find hard to keep track of. We've been calling this a manifest, and I think we should use this terminology in the code to better describe what this class is actually doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

private static final class IndexEntry implements ToXContentFragment {
private static final ParseField INDEX_GENERATION_PARSE_FIELD = new ParseField("generation");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could you reverse the order of these two fields so they're in the same order as they appear in the parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
public class MetaState implements ToXContentFragment {
private final long globalStateGeneration;
private final Map<Index, Long> indices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps globalMetadataGeneration and indexMetadataGeneration? Maybe the Metadata in the middle is unncessary.

Copy link
Contributor Author

@andrershov andrershov Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

globalGeneration and indexGeneration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

public static MetaState empty() {
return new MetaState(-1, Collections.emptyMap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think -1 should be a named constant (e.g. MANIFEST_MISSING) and we should have assertions that we never write out a manifest file with such a generation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed after MetaStateService refactoring

}
for (IndexMetaData indexMetaData : upgradedMetaData) {
if (metaData.hasIndexMetaData(indexMetaData) == false) {
metaStateService.writeIndex("upgrade", indexMetaData);
} else {
metaStateService.writeIndex("startup", indexMetaData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to write out an unchanged index metadata file if the manifest file was missing. If it's present in a manifest file then we know it was fsynced, and I think rewriting every index's metadata on startup will substantially slow down the startup of a node. (The same is true of the global metadata, but the effects of that on startup time will be less pronounced.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added isBwcMode to MetaStateService to check this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertThat(loadedMetaData.index("index"), equalTo(index_v2));
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra whitespace here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import org.elasticsearch.test.ESTestCase;

import static org.hamcrest.Matchers.equalTo;
import static org.hamcrest.Matchers.is;
import static org.hamcrest.Matchers.nullValue;

public class MetaStateServiceTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests should assert that the MetaStateService is in its reset state after writing the manifest file. Maybe they should also sometimes construct a new MetaStateService when reading things back in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
public static class FullRestartCallback extends RestartCallback {
public void onAllNodesStopped(List<String> nodeNames) throws Exception {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertThat(loadedMetaData.index("index"), equalTo(index_v1));

MetaState.FORMAT.cleanupOldFiles(Long.MAX_VALUE, env.nodeDataPaths()); // this will erase manifest file
loadedMetaData = metaStateService.loadMetaData(); //this must load new metadata, because manifest file is gone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a different MetaStateService instance again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaState.FORMAT.cleanupOldFiles(Long.MAX_VALUE, env.nodeDataPaths()); // this will erase manifest file
loadedMetaData = metaStateService.loadMetaData(); //this must load new metadata, because manifest file is gone
assertTrue(MetaData.isGlobalStateEquals(loadedMetaData, metaData_v2));
assertThat(loadedMetaData.index("index"), equalTo(index_v2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also verify that we can correctly go from not having a manifest file to having one, because that's the sequence that we want to support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrershov
Copy link
Contributor Author

@DaveCTurner thank you for the review. I've resolved your comments and have performed the major refactoring of MetaStateService - now it only loads Manifest file and MetaData on the startup and after that keeps track of them in memory. The previous implementation was incorrect because loadMetadata could be called concurrently with applying new cluster event. Could you do you please make a second pass?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this PR, MetaStateService was stateless and only GatewayMetaState implemented the logic for incrementally writing out the state. Now this seems to be somewhat shared with the two classes being stateful. I don't see the benefit of having this extra level of abstraction that's now offered by MetaStateService. Can we fold those two classes together into one?

On a related note, I think we can simplify the implementation by only writing the cluster state atomically on master-eligible nodes. On data-only nodes (which have the extra logic for closed indices as well as only writing out the index metadata for the indices that they have) I think it's ok if we don't write out the cluster state atomically, as it's only used for dangling indices functionality, which already has weak consistency guarantees.

With the two points above, I think we should look at having two separate methods for master-eligible and data-only nodes, where both can hopefully be simpler. Let's discuss this tomorrow.

return (Long) generationAndListOfIndexEntries[0];
}

private static Map<Index, Long> indicies(Object[] generationAndListOfIndexEntries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/indicies/indices

}

private static final ConstructingObjectParser<Manifest, Void> PARSER = new ConstructingObjectParser<>(
"state",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/state/manifest?

if (MetaData.isGlobalStateEquals(metaData, upgradedMetaData) == false) {
metaStateService.writeGlobalState("upgrade", upgradedMetaData);
} else {
if (metaStateService.isBwcMode()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to expose bwc mode? Could we instead just do the keepOrWrite internally based on BWC mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaStateService does not perform metadata upgrade on startup. Actually, it never initiates writes itself. It's the responsibility to GatewayMetaState to understand what needs to be written. Here we have pretty sophisticated requirement that we want to write non-upgraded metadata only if we're working in BWC mode. In case we're working in non-BWC mode, we want to re-write manifest only. That's why this flag is exposed.

} catch (Exception e) {
logger.error("failed to read local state, exiting...", e);
logger.error("failed to bootstrap, exiting...", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why call this bootstrap? Maybe "failed to process on-disk cluster state at startup"?

public MetaData getMetaData() {
MetaData metaData = metaStateService.getMetaData();
//although metadata returned by MetaStateService is nullable, our callers expect non-nullable metadata
return metaData == null ? MetaData.builder().build() : metaData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaData.EMPTY_META_DATA

throw e;
}
}
}

public MetaData loadMetaState() throws IOException {
return metaStateService.loadFullState();
public MetaData getMetaData() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this getMetaDataOrDefault()

metaStateService.writeManifest("changed");
previousMetaData = newMetaData;
} catch (WriteStateException e) {
logger.error("Exception occurred", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide a better exception message here? e.g. "failed to persist cluster state " + event.state().version()

@andrershov
Copy link
Contributor Author

@ywelsch I've resolved issues that you pointed out to inline in aae96cc.
Regarding GatewayMetaState and MetaStateService responsibilities. MetaStateService is responsible for writing/loading metadata to/from disk, so that callers do not worry about the need to keep manifest file up-to-date and the need to be able to support bwc mode, when no manifest file is available. MetaStateService never writes something w/o explicit request from the outside.
GatewayMetaState on the other hand is responsible for processing ClusterStateChange events and figuring out what needs to be re-written.
To cut a long story short, GatewayMetaState is what and MetaStateService is how.
Also, David has requested to make only small modifications to GatewayMetaState to make code easier to review.
One more thing is that our ultimate goal is to implement PersistedState interface that would be different from GatewayMetaState and this implementation will most likely re-use existing MetaStateService functionality.

@andrershov andrershov mentioned this pull request Nov 6, 2018
6 tasks
@andrershov andrershov closed this Nov 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants