-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(tsm1): skip WriteSnapshot during backup if snapshotter is busy #16627
Conversation
0e5b41a
to
4f359b3
Compare
tsdb/engine/tsm1/engine.go
Outdated
} | ||
|
||
if err != nil { | ||
e.logger.Info("WAL busy: Backup proceeding without WAL contents", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change WAL
to cache
since it's Cache.Snapshot()
that returns the error
tsdb/engine/tsm1/engine.go
Outdated
if err != nil { | ||
switch err { | ||
case ErrSnapshotInProgress: | ||
backoff := time.Duration(math.Pow(3.8, float64(i))) * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the best backoff would be. 3.8 seems low. That's 14.4ms on the 3rd iteration. Snapshots are generally < 1s but can be longer even in a healthy system sometimes. Maybe 32 instead of 3.8? That would top out at ~1s.
4f359b3
to
ea7e060
Compare
tsdb/engine/tsm1/engine.go
Outdated
@@ -1893,8 +1893,22 @@ func (e *Engine) WriteSnapshot() (err error) { | |||
// CreateSnapshot will create a temp directory that holds | |||
// temporary hardlinks to the underylyng shard files. | |||
func (e *Engine) CreateSnapshot() (string, error) { | |||
if err := e.WriteSnapshot(); err != nil { | |||
return "", err | |||
var err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - is this only used for backups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what I can tell this is the unique place where we use it
Line 674 in a56e022
func (e *Engine) CreateBackup(ctx context.Context) (int, []string, error) { |
So yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is about fixing a specific backup problem, and not CreateSnapshot
per-se, I'm wondering if (*Engine).CreateSnapshot
should broadly be left alone, and the retry logic should be added to the actual backup code, which would be:
influxdb/tsdb/engine/tsm1/engine.go
Lines 914 to 923 in 741ac8e
func (e *Engine) Backup(w io.Writer, basePath string, since time.Time) error { | |
path, err := e.CreateSnapshot() | |
if err != nil { | |
return err | |
} | |
// Remove the temporary snapshot dir | |
defer os.RemoveAll(path) | |
return intar.Stream(w, path, basePath, intar.SinceFilterTarFile(since)) | |
} |
I guess?
What do you think?
tsdb/engine/tsm1/engine.go
Outdated
@@ -1893,8 +1893,22 @@ func (e *Engine) WriteSnapshot() (err error) { | |||
// CreateSnapshot will create a temp directory that holds | |||
// temporary hardlinks to the underylyng shard files. | |||
func (e *Engine) CreateSnapshot() (string, error) { | |||
if err := e.WriteSnapshot(); err != nil { | |||
return "", err | |||
var err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ea7e060
to
7d7a25d
Compare
default: | ||
return err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more tiny nit. Since technically after three attempts we will be potentially ignoring an error we should still handle it. In this case we should log a warning to say that we couldn't snapshot the cache.
7d7a25d
to
bfc5de6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
bfc5de6
to
30621ca
Compare
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: `operation timed out with error: create snapshot: snapshot in progress`. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. This PR skips snapshots if the `snapshotter` does not come available after three attempts when a backup is requested. The backup won't contain the data in the cache or WAL. Signed-off-by: Gianluca Arbezzano <[email protected]>
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() Adding engine_internal_test.go theto test the skipCacheOk flag to tsdb.Shard.CreateSnapshot() and tsdb.Engine.CreateSnapshot() influxdata/plutonium#3227
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227 (cherry picked from commit 23be20b)
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227 (cherry picked from commit 23be20b) (cherry picked from commit 0b1ee04)
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227 (cherry picked from commit 23be20b)
v1.8.4 [2021-01-27] ------------------- ### Bugfixes - [#19696](influxdata/influxdb#19697): fix(flux): add durations to Flux logging v1.8.3 [2020-09-30] ------------------- ### Features - [#19187](influxdata/influxdb#19187): feat: Collect values written stats. - [#19611](influxdata/influxdb#19611): feat: Add -lponly flag to export sub-command. ### Bugfixes - [#19409](influxdata/influxdb#19409): chore: update uuid library from satori to gofrs. - [#19439](influxdata/influxdb#19439): fix(storage): ArrayFilterCursor truncation for multi-block data. - [#19460](influxdata/influxdb#19460): chore: Use latest version of influxql package. - [#19512](influxdata/influxdb#19512): chore: Quiet static analysis tools. - [#19592](influxdata/influxdb#19592): fix(services/storage): multi measurement queries return all applicable series. - [#19612](influxdata/influxdb#19612): fix: lock map before writes. v1.8.2 [2020-08-13] ------------------- ### Bugfixes - [#19253](influxdata/influxdb#19253): fix(tsdb): Revert disable series id set cache size by default. v1.8.1 [2020-07-08] ------------------- ### Features - [#18457](influxdata/influxdb#18457): feat(query): Parallelize field iterator planning. - [#18886](influxdata/influxdb#18886): feat(http): Allow user supplied HTTP headers. ### Bugfixes - [#17319](influxdata/influxdb#17319): fix(flux): buckets call no longer panics. - [#18212](influxdata/influxdb#18212): fix(tsdb): Defer closing of underlying SeriesIDSetIterators. - [#18286](influxdata/influxdb#18286): fix(tsdb): Disable series id set cache size by default. - [#18299](influxdata/influxdb#18299): refactor(http): Simplify Authorizer. - [#18694](influxdata/influxdb#18694): fix(tsi1): wait deleting epoch before dropping shard. - [#18687](influxdata/influxdb#18687): perf(tsi1): batch write tombstone entries when dropping/deleting. - [#18826](influxdata/influxdb#18826): fix: gracefully handle errors when creating snapshots. - [#18849](influxdata/influxdb#18849): chore(build): remove all of the go1.12 references from build. v1.8.0 [2020-04-11] ------------------- ### Features - [#15952](influxdata/influxdb#15952): Add influx_inspect verify-tombstone tool. - [#16542](influxdata/influxdb#16542): Add offline series compaction to influx_inspect buildtsi. - [#16599](influxdata/influxdb#16599): Make influx CLI support custom HTTP endpoint. - [#16908](influxdata/influxdb#16908): Add support for InfluxDB 2.0 write API. - [#17621](influxdata/influxdb#17621): Update Flux to v0.65.0. - [#17188](influxdata/influxdb#17188): Enhance support for bound parameters. ### Bugfixes - [#10503](influxdata/influxdb#10503): Delete rebuilds series index when series to be deleted are only found in cache. - [#10504](https://github.com/influxdata/influxdb/issue/10504): Delete rebuilds series index when series to be deleted are outside timerange. - [#14485](influxdata/influxdb#14485): Parse Accept header correctly. - [#16524](influxdata/influxdb#16524): Upgrade compaction error log from `Info` to `Warn`. - [#16525](influxdata/influxdb#16525): Remove double increment of meta index. - [#16595](influxdata/influxdb#16595): Improve series cardinality limit for inmem index. - [#16606](influxdata/influxdb#16606): Ensure all block data returned. - [#16627](influxdata/influxdb#16627): Skip WriteSnapshot during backup if snapshotter is busy. - [#16709](influxdata/influxdb#16709): Reduce influxd and influx startup time if Flux isn't used. - [#16762](influxdata/influxdb#16762): Fix bugs in -compact-series-file. - [#16944](influxdata/influxdb#16944): Update to Go 1.13.8 and Go modules. - [#17032](influxdata/influxdb#17032): Fix a SIGSEGV when accessing tsi active log. - [#17656](influxdata/influxdb#17656): Verify precision in write requests. - [#17698](influxdata/influxdb#17698): Enable configuration of TLS 1.3.
v1.8.4 [2021-01-27] ------------------- ### Bugfixes - [#19696](influxdata/influxdb#19697): fix(flux): add durations to Flux logging v1.8.3 [2020-09-30] ------------------- ### Features - [#19187](influxdata/influxdb#19187): feat: Collect values written stats. - [#19611](influxdata/influxdb#19611): feat: Add -lponly flag to export sub-command. ### Bugfixes - [#19409](influxdata/influxdb#19409): chore: update uuid library from satori to gofrs. - [#19439](influxdata/influxdb#19439): fix(storage): ArrayFilterCursor truncation for multi-block data. - [#19460](influxdata/influxdb#19460): chore: Use latest version of influxql package. - [#19512](influxdata/influxdb#19512): chore: Quiet static analysis tools. - [#19592](influxdata/influxdb#19592): fix(services/storage): multi measurement queries return all applicable series. - [#19612](influxdata/influxdb#19612): fix: lock map before writes. v1.8.2 [2020-08-13] ------------------- ### Bugfixes - [#19253](influxdata/influxdb#19253): fix(tsdb): Revert disable series id set cache size by default. v1.8.1 [2020-07-08] ------------------- ### Features - [#18457](influxdata/influxdb#18457): feat(query): Parallelize field iterator planning. - [#18886](influxdata/influxdb#18886): feat(http): Allow user supplied HTTP headers. ### Bugfixes - [#17319](influxdata/influxdb#17319): fix(flux): buckets call no longer panics. - [#18212](influxdata/influxdb#18212): fix(tsdb): Defer closing of underlying SeriesIDSetIterators. - [#18286](influxdata/influxdb#18286): fix(tsdb): Disable series id set cache size by default. - [#18299](influxdata/influxdb#18299): refactor(http): Simplify Authorizer. - [#18694](influxdata/influxdb#18694): fix(tsi1): wait deleting epoch before dropping shard. - [#18687](influxdata/influxdb#18687): perf(tsi1): batch write tombstone entries when dropping/deleting. - [#18826](influxdata/influxdb#18826): fix: gracefully handle errors when creating snapshots. - [#18849](influxdata/influxdb#18849): chore(build): remove all of the go1.12 references from build. v1.8.0 [2020-04-11] ------------------- ### Features - [#15952](influxdata/influxdb#15952): Add influx_inspect verify-tombstone tool. - [#16542](influxdata/influxdb#16542): Add offline series compaction to influx_inspect buildtsi. - [#16599](influxdata/influxdb#16599): Make influx CLI support custom HTTP endpoint. - [#16908](influxdata/influxdb#16908): Add support for InfluxDB 2.0 write API. - [#17621](influxdata/influxdb#17621): Update Flux to v0.65.0. - [#17188](influxdata/influxdb#17188): Enhance support for bound parameters. ### Bugfixes - [#10503](influxdata/influxdb#10503): Delete rebuilds series index when series to be deleted are only found in cache. - [#10504](https://github.com/influxdata/influxdb/issue/10504): Delete rebuilds series index when series to be deleted are outside timerange. - [#14485](influxdata/influxdb#14485): Parse Accept header correctly. - [#16524](influxdata/influxdb#16524): Upgrade compaction error log from `Info` to `Warn`. - [#16525](influxdata/influxdb#16525): Remove double increment of meta index. - [#16595](influxdata/influxdb#16595): Improve series cardinality limit for inmem index. - [#16606](influxdata/influxdb#16606): Ensure all block data returned. - [#16627](influxdata/influxdb#16627): Skip WriteSnapshot during backup if snapshotter is busy. - [#16709](influxdata/influxdb#16709): Reduce influxd and influx startup time if Flux isn't used. - [#16762](influxdata/influxdb#16762): Fix bugs in -compact-series-file. - [#16944](influxdata/influxdb#16944): Update to Go 1.13.8 and Go modules. - [#17032](influxdata/influxdb#17032): Fix a SIGSEGV when accessing tsi active log. - [#17656](influxdata/influxdb#17656): Verify precision in write requests. - [#17698](influxdata/influxdb#17698): Enable configuration of TLS 1.3.
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is:
operation timed out with error: create snapshot: snapshot in progress
.This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
This PR skips snapshots if the
snapshotter
does not come availableafter three attempts when a backup is requested.
The backup won't contain the data in the cache or WAL.
Signed-off-by: Gianluca Arbezzano [email protected]