Skip to content

Commit

Permalink
sync: the include and exclude options consistent with rsync behav…
Browse files Browse the repository at this point in the history
…ior (#1554)

* the `include` and `exclude` options consistent with rsync behavior
  • Loading branch information
zhijian-pro authored Mar 18, 2022
1 parent 88c3f9b commit 079ccf8
Show file tree
Hide file tree
Showing 5 changed files with 292 additions and 55 deletions.
29 changes: 26 additions & 3 deletions cmd/sync.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,33 @@ func cmdSync() *cli.Command {
This tool spawns multiple threads to concurrently syncs objects of two data storages.
SRC and DST should be [NAME://][ACCESS_KEY:SECRET_KEY@]BUCKET[.ENDPOINT][/PREFIX].
Include/exclude pattern rules:
The include/exclude rules each specify a pattern that is matched against the names of the files that are going to be transferred. These patterns can take several forms:
- if the pattern ends with a / then it will only match a directory, not a file, link, or device.
- it chooses between doing a simple string match and wildcard matching by checking if the pattern contains one of these three wildcard characters: '*', '?', and '[' .
- a '*' matches any non-empty path component (it stops at slashes).
- a '?' matches any character except a slash (/).
- a '[' introduces a character class, such as [a-z] or [[:alpha:]].
- in a wildcard pattern, a backslash can be used to escape a wildcard character, but it is matched literally when no wildcards are present.
- it does a prefix match of pattern, i.e. always recursive
Examples:
# Sync object from OSS to S3
$ juicefs sync oss://mybucket.oss-cn-shanghai.aliyuncs.com s3://mybucket.s3.us-east-2.amazonaws.com
# Sync objects from S3 to JuiceFS
$ juicefs mount -d redis://localhost /mnt/jfs
$ juicefs sync s3://mybucket.s3.us-east-2.amazonaws.com /mnt/jfs
$ juicefs sync s3://mybucket.s3.us-east-2.amazonaws.com/ /mnt/jfs/
# SRC: a1/b1,a2/b2,aaa/b1 DST: empty sync result: aaa/b1
$ juicefs sync --exclude='a?/b*' s3://mybucket.s3.us-east-2.amazonaws.com/ /mnt/jfs/
# SRC: a1/b1,a2/b2,aaa/b1 DST: empty sync result: a1/b1,aaa/b1
$ juicefs sync --include='a1/b1' --exclude='a[1-9]/b*' s3://mybucket.s3.us-east-2.amazonaws.com/ /mnt/jfs/
# SRC: a1/b1,a2/b2,aaa/b1,b1,b2 DST: empty sync result: a1/b1,b2
$ juicefs sync --include='a1/b1' --exclude='a*' --include='b2' --exclude='b?' s3://mybucket.s3.us-east-2.amazonaws.com/ /mnt/jfs/
Supported storage systems: https://juicefs.com/docs/community/how_to_setup_object_storage#supported-object-storage`,
Flags: []cli.Flag{
Expand Down Expand Up @@ -111,11 +131,11 @@ Supported storage systems: https://juicefs.com/docs/community/how_to_setup_objec
},
&cli.StringSliceFlag{
Name: "exclude",
Usage: "exclude keys containing `PATTERN` (POSIX regular expressions)",
Usage: "exclude Key matching PATTERN",
},
&cli.StringSliceFlag{
Name: "include",
Usage: "only include keys containing `PATTERN` (POSIX regular expressions)",
Usage: "don't exclude Key matching PATTERN",
},
&cli.StringFlag{
Name: "manager",
Expand Down Expand Up @@ -281,6 +301,9 @@ func isS3PathType(endpoint string) bool {

func doSync(c *cli.Context) error {
setup(c, 2)
if c.IsSet("include") && !c.IsSet("exclude") {
logger.Warnf("The include option needs to be used with the exclude option, otherwise the result of the current sync may not match your expectations")
}
config := sync.NewConfigFromCli(c)
go func() { _ = http.ListenAndServe(fmt.Sprintf("127.0.0.1:%d", config.HTTPPort), nil) }()

Expand Down
12 changes: 10 additions & 2 deletions docs/en/reference/command_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -512,6 +512,10 @@ The format of both the source and destination paths is `[NAME://][ACCESS_KEY:SEC
- `BUCKET[.ENDPOINT]`: The access address of the data storage service, the format may be different for different storage types, please refer to [document](how_to_setup_object_storage.md#supported-object-storage).
- `[/PREFIX]`: Optional, a prefix for the source and destination paths that can be used to limit the synchronization to only data in certain paths.

:::note
If you want to express the concept of a folder in `SRC` or `DST`, please make sure that the path ends with "/" or "\" , otherwise it will be considered as the prefix of the object name.
:::

#### Options

`--start KEY, -s KEY`<br />
Expand Down Expand Up @@ -548,10 +552,14 @@ delete objects from source after synced (default: false)
delete extraneous objects from destination (default: false)

`--exclude PATTERN`<br />
exclude keys containing PATTERN (POSIX regular expressions)
exclude Key matching PATTERN

`--include PATTERN`<br />
only include keys containing PATTERN (POSIX regular expressions)
don't exclude Key matching PATTERN. Need to be used with `--exclude PATTERN`.

:::tip
The order in which `--exclude` and `--include` are set will affect the result. Each object will be matched in the order in which the two parameters appear. Once the PATTERN of a parameter is matched, the behavior of the object is the type of the parameter, and the matching of the parameters that appear later will not be attempted. If the object is not matched by any of the parameters, the default behavior of the object is include . `--include` and `--exclude` parameters are designed with reference to `rsync`, but currently we do not support the two matching rules of `**` and `***` in `rsync`.
:::

`--manager value`<br />
manager address
Expand Down
12 changes: 10 additions & 2 deletions docs/zh_cn/reference/command_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -510,6 +510,10 @@ juicefs sync [command options] SRC DST
- `BUCKET[.ENDPOINT]`:数据存储服务的访问地址,不同存储类型格式可能不同,具体请参考[文档](how_to_setup_object_storage.md#支持的存储服务)
- `[/PREFIX]`:可选,源路径和目标路径的前缀,可用于限定只同步某些路径中的数据。

:::tip
如果想要在 `SRC` 或者 `DST` 中表达文件夹的概念时,请确保路径是以 "/" 或者 "\" 结尾的,否则将会被认为是对象名的前缀。
:::

#### 选项

`--start KEY, -s KEY`<br />
Expand Down Expand Up @@ -546,10 +550,14 @@ juicefs sync [command options] SRC DST
删除目标存储下的不相关对象 (默认: false)

`--exclude PATTERN`<br />
跳过包含 PATTERN (POSIX正则表达式) 的对象名
排除匹配 PATTERN 的 Key

`--include PATTERN`<br />
仅同步包含 PATTERN (POSIX正则表达式) 的对象名
不排除匹配 PATTERN 的 Key, 需要与`--exclude` 配合使用。

:::tip
`--exclude``--include` 的设置顺序将会影响运行结果。每个对象将按照这两个参数出现的先后顺序依次匹配,一旦匹配某个参数的 PATTERN ,那么该对象的行为就是这个参数的类型,不再尝试后出现的参数的匹配。如果该个对象没有被任何一个参数匹配到,那么该对象的默认行为 include 。 `--include``--exclude` 参数的设计参考了 `rsync` ,但是目前我们不支持 `rsync` 中的 `**``***` 这两条匹配规则。
:::

`--manager value`<br />
管理者地址
Expand Down
89 changes: 61 additions & 28 deletions pkg/sync/sync.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ import (
"io"
"io/ioutil"
"os"
"regexp"
"path"
"runtime"
"strings"
"sync"
"time"
Expand Down Expand Up @@ -586,8 +587,9 @@ func producer(tasks chan<- object.Object, src, dst object.ObjectStorage, config
logger.Fatal(err)
}
if config.Exclude != nil {
srckeys = filter(srckeys, config.Include, config.Exclude)
dstkeys = filter(dstkeys, config.Include, config.Exclude)
rules := initRules(src, dst)
srckeys = filter(srckeys, rules)
dstkeys = filter(dstkeys, rules)
}

defer close(tasks)
Expand Down Expand Up @@ -661,51 +663,82 @@ func producer(tasks chan<- object.Object, src, dst object.ObjectStorage, config
}
}

func compileExp(patterns []string) []*regexp.Regexp {
var rs []*regexp.Regexp
for _, p := range patterns {
r, err := regexp.CompilePOSIX(p)
if err != nil {
logger.Fatalf("invalid regular expression `%s`: %s", p, err)
}
rs = append(rs, r)
}
return rs
type rule struct {
pattern string
include bool
}

func findAny(s string, ps []*regexp.Regexp) bool {
for _, p := range ps {
if p.FindString(s) != "" {
return true
func initRules(src, dst object.ObjectStorage) (rules []*rule) {
l := len(os.Args)
for idx, arg := range os.Args {
if l-1 > idx && (arg == "--include" || arg == "-include") {
rules = append(rules, &rule{pattern: os.Args[idx+1], include: true})
} else if l-1 > idx && (arg == "--exclude" || arg == "-exclude") {
rules = append(rules, &rule{pattern: os.Args[idx+1], include: false})
} else if strings.HasPrefix(arg, "--include=") || strings.HasPrefix(arg, "-include=") {
if s := strings.Split(arg, "="); len(s) == 2 && s[1] != "" {
rules = append(rules, &rule{pattern: s[1], include: true})
}
} else if strings.HasPrefix(arg, "--exclude=") || strings.HasPrefix(arg, "-exclude=") {
if s := strings.Split(arg, "="); len(s) == 2 && s[1] != "" {
rules = append(rules, &rule{pattern: s[1], include: false})
}
}
}
return false
if runtime.GOOS == "windows" && (strings.HasPrefix(src.String(), "file:") || strings.HasPrefix(dst.String(), "file:")) {
for _, r := range rules {
r.pattern = strings.Replace(r.pattern, "\\", "/", -1)
}
}
return
}

func filter(keys <-chan object.Object, include, exclude []string) <-chan object.Object {
inc := compileExp(include)
exc := compileExp(exclude)
func filter(keys <-chan object.Object, rules []*rule) <-chan object.Object {
r := make(chan object.Object)
go func() {
for o := range keys {
if o == nil {
break
}
if findAny(o.Key(), exc) {
if includeObject(rules, o) {
r <- o
} else {
logger.Debugf("exclude %s", o.Key())
continue
}
if len(inc) > 0 && !findAny(o.Key(), inc) {
logger.Debugf("%s is not included", o.Key())
continue
}
r <- o
}
close(r)
}()
return r
}

func alignPatternAndKey(pattern, key string) string {
sep := "/"
l := strings.Count(pattern, sep) + 1
ps := strings.Split(key, sep)
if len(ps) < l {
return key
} else if strings.HasSuffix(pattern, sep) {
return strings.Join(ps[:l-1], sep) + sep
} else {
return strings.Join(ps[:l], sep)
}
}

// Consistent with rsync behavior, the matching order is adjusted according to the order of the "include" and "exclude" options
func includeObject(rules []*rule, o object.Object) bool {
for _, rule := range rules {
k := alignPatternAndKey(rule.pattern, o.Key())
match, err := path.Match(rule.pattern, k)
if err != nil {
logger.Fatalf("pattern error : %v", err)
}
if match {
return rule.include
}
}
return true
}

// Sync syncs all the keys between to object storage
func Sync(src, dst object.ObjectStorage, config *Config) error {
var bufferSize = 10240
Expand Down
Loading

0 comments on commit 079ccf8

Please sign in to comment.