Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix extended glob #2182

Closed

Conversation

Lehmann-Fabian
Copy link
Contributor

This PR solves the problems with the extended glob, which can be seen in #2181. Now it passes all tests.
I promised this fix in #2035.

Signed-off-by: Lehmann-Fabian <[email protected]>
Signed-off-by: Lehmann-Fabian <[email protected]>
@Lehmann-Fabian
Copy link
Contributor Author

I'm not sure why it is failing.

pathes=`ls -1d ${escape.join(' ')} | sort | uniq`
shopt -u globstar extglob || true
set -f
for name in \$pathes; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this prevents the wildcards expansion. Why is needed? they should have already been resolved by the ls -1d command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, set -f prevents it. This is needed, if file names contain a special character like a question mark, which would otherwise be resolved again. There is one test case which fails otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not fully understanding this. Can you provide a minimal example NF script showing how this works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the following process should only find one file - the one with a question mark in the file name.

process a {

    echo true

    output:
    file("ab\\?.txt") into found

    """
    touch abc.txt
    touch ab\\?.txt
    ls
    """

}
found.view()

Don't disable wildcard expansion would lead to a new expansion in the for loop and once again in the copy/move method. Accordingly, both files will be moved/copied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pditommaso, I’m not sure whether this helped you understand my contribution or you need an additional example. I look forward to your feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a problem can come from the fact that NF expects the use of the glob: option to switch off the matching of the ? special character. My tests:

without glon option

process a {

    output:
    path "ab?.txt" into found

    """
    touch abc.txt
    touch ab\\?.txt
    ls
    """

}
found.flatten().view()

shell:

» nextflow run test.nf
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test.nf` [elated_nobel] - revision: 198b348c2b
executor >  local (1)
[23/6a07e6] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/23/6a07e6c49fe17b96cab4c476a2b83f/ab?.txt
/Users/pditommaso/Projects/nextflow/work/23/6a07e6c49fe17b96cab4c476a2b83f/abc.txt

OK

using glob option

process a {

    output:
    path "ab?.txt", glob: false into found

    """
    touch abc.txt
    touch ab\\?.txt
    ls
    """

}
found.flatten().view()

Shell:

» nextflow run test.nf
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test.nf` [curious_brown] - revision: 4cae485aee
executor >  local (1)
[56/bda665] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/56/bda665ba766227bbecf15e8611cbf6/ab?.txt

OK

glob option + enable scratch option

process a {

    output:
    path "ab?.txt", glob: false into found

    """
    touch abc.txt
    touch ab\\?.txt
    ls
    """

}
found.flatten().view()

Shell command:

» nextflow run test.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test.nf` [high_minsky] - revision: 4cae485aee
executor >  local (1)
[82/ef2225] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/82/ef22257c5eb503b4f57e993c8aca51/ab?.txt

OK, though

Work dir:

» ls -lh work/82/ef22257c5eb503b4f57e993c8aca51
total 0
-rw-r--r--  1 pditommaso  staff     0B Oct  6 16:28 ab?.txt
-rw-r--r--  1 pditommaso  staff     0B Oct  6 16:28 abc.txt

I think your concern is the above, that since the ab?.txt was requested, both files were copied from the task scratch copied. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. My concern is that the file is copied/moved anyways. That is why I prevent the second wildcard expansion since the absolute file paths are known then.
a) If you do not see a problem, we can delete the set -f. But this will generate more I/O.

Anyway, I see my commit does not consider the option glob: false. In that case, we should also say set -f instead of shopt -s globstar extglob || true to activate the glob.
b) How do you think about that?
If you agree to both, I will change a and b as proposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two problems: 1) handling correctly glob:false 2) handle ** wildcard

For 1, I think it could be addressed replacing Escape.path with Escape.wildcard when glob==false here

For 2, I think it would be preferable to keep eval subshell, I mean having

for name in \$(eval "ls -1d ${escape.join(' ')}" | sort | uniq); do
... 

Copy link
Contributor Author

@Lehmann-Fabian Lehmann-Fabian Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me start with 2).
I found the problem you meant.
There was a problem if the ** was within paths like a/dir**file.txt, but I fixed it in the last commit.
I prepared 4 test cases and tested them three times each.

  1. with the current Nextflow version
  2. with the current Nextflow version and -process.scratch true
  3. with my fixed Nextflow version and -process.scratch true

For portability, the second and third test case should find the same files as the first one. However, the second and third should only copy the files found.

Testcase 1

process a {

    output:
    path "{a,b,c}/*" into found

    """
    mkdir a b c d
    touch file.txt
    touch a/file.txt
    touch c/file.txt
    touch d/file.txt
    ls
    """

}
found.flatten().view()

1.

$ ./nextflow run test1.nf
N E X T F L O W  ~  version 21.04.3
Launching `test1.nf` [lonely_sanger] - revision: 06c25e47e3
executor >  local (1)
[21/781741] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/21/781741acd0b0369ababfe7c881ebc1/a/file.txt
/home/fabian/extendedGlob/work/21/781741acd0b0369ababfe7c881ebc1/c/file.txt

Files in workdir:

work/21/781741acd0b0369ababfe7c881ebc1/
├── a
│   └── file.txt
├── b
├── c
│   └── file.txt
├── d
│   └── file.txt
└── file.txt

4 directories, 4 files

2.

$ ./nextflow run test1.nf -process.scratch true
N E X T F L O W  ~  version 21.04.3
Launching `test1.nf` [intergalactic_dijkstra] - revision: 06c25e47e3
executor >  local (1)
[5b/b09417] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/5b/b09417152861aa02b76b6b756f92ae/c/a
/home/fabian/extendedGlob/work/5b/b09417152861aa02b76b6b756f92ae/c/b
/home/fabian/extendedGlob/work/5b/b09417152861aa02b76b6b756f92ae/c/file.txt

Files in workdir:

work/5b/b09417152861aa02b76b6b756f92ae/
├── a
├── b
└── c
    ├── a
    ├── b
    └── file.txt

5 directories, 1 file

For the current Nextflow version, the files are copied into the wrong hierarchy. Accordingly, the wrong and not all files are found.

3.

$ ./nextflow-fix run test1.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test1.nf` [gigantic_stone] - revision: 06c25e47e3
executor >  local (1)
[89/51d3b7] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/89/51d3b76e56ae498416cc307209a3aa/a/file.txt
/home/fabian/extendedGlob/work/89/51d3b76e56ae498416cc307209a3aa/c/file.txt

Files in workdir:

work/89/51d3b76e56ae498416cc307209a3aa/
├── a
│   └── file.txt
└── c
    └── file.txt

2 directories, 2 files

Testcase 2

process a {

    output:
    path "a/*/c/*" into found

    """
    mkdir -p a/b/c/ a/b/d/c/
    touch a/b/c/file.txt
    touch a/b/c/file2.txt
    touch a/b/d/c/file2.txt
    touch a/b/file.txt
    touch a/file.txt
    touch file.txt
    ls
    """

}
found.flatten().view()

1.

$ ./nextflow run test2.nf
N E X T F L O W  ~  version 21.04.3
Launching `test2.nf` [modest_venter] - revision: 53280a4646
executor >  local (1)
[3f/eba9de] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/3f/eba9de27063e039e2d4e5351f15817/a/b/c/file.txt
/home/fabian/extendedGlob/work/3f/eba9de27063e039e2d4e5351f15817/a/b/c/file2.txt

Files in workdir:

work/3f/eba9de27063e039e2d4e5351f15817/
├── a
│   ├── b
│   │   ├── c
│   │   │   ├── file2.txt
│   │   │   └── file.txt
│   │   ├── d
│   │   │   └── c
│   │   │       └── file2.txt
│   │   └── file.txt
│   └── file.txt
└── file.txt

5 directories, 6 files

2.

$ ./nextflow run test2.nf -process.scratch true
N E X T F L O W  ~  version 21.04.3
Launching `test2.nf` [romantic_dijkstra] - revision: 53280a4646
executor >  local (1)
[23/f55e4c] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/23/f55e4c82bf437bc047c8b670c26457/a/*/c/file.txt
/home/fabian/extendedGlob/work/23/f55e4c82bf437bc047c8b670c26457/a/*/c/file2.txt

Files in workdir:

work/23/f55e4c82bf437bc047c8b670c26457/
└── a
    └── *
        └── c
            ├── file2.txt
            └── file.txt

3 directories, 2 files

This time the * is somehow used as the output file name. Accordingly, some files are overwritten, and not all files are found in the end.

3.

$ ./nextflow-fix run test2.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test2.nf` [festering_jang] - revision: 53280a4646
executor >  local (1)
[8c/cfc9ff] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/8c/cfc9ffcc0949937b0982ea120e5303/a/b/c/file.txt
/home/fabian/extendedGlob/work/8c/cfc9ffcc0949937b0982ea120e5303/a/b/c/file2.txt

Files in workdir:

work/8c/cfc9ffcc0949937b0982ea120e5303/
└── a
    └── b
        └── c
            ├── file2.txt
            └── file.txt

3 directories, 2 files

Testcase 3

process a {

    output:
    path "a/dir*/file.txt" into found

    """
    mkdir -p a/dir1/a a/dir2
    touch a/dir1/file.txt
    touch a/dir2/file.txt
    touch a/dir1/a/file.txt
    ls
    """

}
found.flatten().view()

1.

$ ./nextflow run test3.nf
N E X T F L O W  ~  version 21.04.3
Launching `test3.nf` [gigantic_hodgkin] - revision: 3c4fcfe408
executor >  local (1)
[3b/88992b] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/3b/88992b05421a86ecae9cdee5d415d8/a/dir1/file.txt
/home/fabian/extendedGlob/work/3b/88992b05421a86ecae9cdee5d415d8/a/dir2/file.txt

Files in workdir:

work/3b/88992b05421a86ecae9cdee5d415d8/
└── a
    ├── dir1
    │   ├── a
    │   │   └── file.txt
    │   └── file.txt
    └── dir2
        └── file.txt

4 directories, 3 files

2.

$ ./nextflow run test3.nf -process.scratch true
N E X T F L O W  ~  version 21.04.3
Launching `test3.nf` [magical_rubens] - revision: 3c4fcfe408
executor >  local (1)
[f2/adbebd] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/f2/adbebd320ec29728872033ebb09bf4/a/dir*/file.txt

Files in workdir:

work/f2/adbebd320ec29728872033ebb09bf4/
└── a
    └── dir*
        └── file.txt

2 directories, 1 file

More or less the same behavior as for test case 2.

3.

$ ./nextflow-fix run test3.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test3.nf` [admiring_blackwell] - revision: 3c4fcfe408
executor >  local (1)
[14/06cc88] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/14/06cc88cf25a6e76ec18ef9789c3478/a/dir1/file.txt
/home/fabian/extendedGlob/work/14/06cc88cf25a6e76ec18ef9789c3478/a/dir2/file.txt

Files in workdir:

work/14/06cc88cf25a6e76ec18ef9789c3478/
└── a
    ├── dir1
    │   └── file.txt
    └── dir2
        └── file.txt

3 directories, 2 files

Testcase 4

process a {

    output:
    path "a/dir**/file.txt" into found

    """
    mkdir -p a/dir1/a/b/ a/dir2 a/dir
    touch a/dir/file.txt
    touch a/dir1/file.txt
    touch a/dir2/file.txt
    touch a/dir1/a/file.txt
    touch a/dir1/a/b/file.txt
    ls
    """

}
found.flatten().view()

1.

$ ./nextflow run test4.nf
N E X T F L O W  ~  version 21.04.3
Launching `test4.nf` [lonely_descartes] - revision: 623cbb1c8b
executor >  local (1)
[f5/36c1a7] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir/file.txt
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir1/a/b/file.txt
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir1/a/file.txt
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir1/file.txt
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir2/file.txt

Files in workdir:

work/f5/36c1a762b733e32d9e1bf20cb66780/
└── a
    ├── dir
    │   └── file.txt
    ├── dir1
    │   ├── a
    │   │   ├── b
    │   │   │   └── file.txt
    │   │   └── file.txt
    │   └── file.txt
    └── dir2
        └── file.txt

6 directories, 5 files

2.

$ ./nextflow run test4.nf -process.scratch true
N E X T F L O W  ~  version 21.04.3
Launching `test4.nf` [insane_rosalind] - revision: 623cbb1c8b
executor >  local (1)
[2f/b19602] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir/file.txt
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir1/a/b/file.txt
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir1/a/file.txt
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir1/file.txt
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir2/file.txt

Files in workdir:

work/2f/b19602d56e03194db570eb45e697bd/
└── a
    ├── dir
    │   └── file.txt
    ├── dir1
    │   ├── a
    │   │   ├── b
    │   │   │   └── file.txt
    │   │   └── file.txt
    │   └── file.txt
    └── dir2
        └── file.txt

6 directories, 5 files

3.

$ ./nextflow-fix run test4.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test4.nf` [happy_golick] - revision: 623cbb1c8b
executor >  local (1)
[ba/6f94f6] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir/file.txt
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir1/a/b/file.txt
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir1/a/file.txt
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir1/file.txt
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir2/file.txt

Files in workdir:

work/ba/6f94f6ff172a2cd39f4deeea0d42fd/
└── a
    ├── dir
    │   └── file.txt
    ├── dir1
    │   ├── a
    │   │   ├── b
    │   │   │   └── file.txt
    │   │   └── file.txt
    │   └── file.txt
    └── dir2
        └── file.txt

6 directories, 5 files

For ** it also works in my fixed version.

From my tests, you can see that my implementation is consistent with the behavior running without a scratch dir.
I implemented many more test cases here

source | inputffiles | outputfiles
'**.txt' | ['file.txt', 'file2.txt', 'file3.txta', 'a/file.txt', 'a/file3.txta', 'b/file.txt'] | ['file.txt', 'file2.txt', 'a/file.txt', 'b/file.txt']
'**.foo' | ['file.txt', 'file2.txt', 'file3.txta', 'a/file.txt', 'a/file3.txta', 'b/file.txt'] | []
'*.txt' | ['file.txt', 'file2.txt', 'file3.txta'] | ['file.txt', 'file2.txt']
'*' | ['file.txt', 'file2.txt', 'file3.txta', 'a/b.txt'] | ['file.txt', 'file2.txt', 'file3.txta', 'a/b.txt']
'**' | ['file.txt', 'file2.txt', 'file3.txta', 'a/b.txt'] | ['file.txt', 'file2.txt', 'file3.txta', 'a/b.txt']
'a/*' | ['a/file.txt', 'a/file2.txt', 'a/b/a.txt', 'a/d/', 'file.txt'] | ['a/file.txt', 'a/file2.txt', 'a/b/a.txt', 'a/d/']
'a/' | ['a/file.txt', 'a/file2.txt', 'a/b/a.txt', 'a/d/', 'file.txt'] | ['a/file.txt', 'a/file2.txt', 'a/b/a.txt', 'a/d/']
'a' | ['a/file.txt', 'a/file2.txt', 'a/b/a.txt', 'a/d/', 'file.txt'] | ['a/file.txt', 'a/file2.txt', 'a/b/a.txt', 'a/d/']
'a b/*' | ['a b/file.txt', 'a b/file2.txt', 'a b/b/a.txt', 'a b/d/', 'file.txt'] | ['a b/file.txt', 'a b/file2.txt', 'a b/b/a.txt', 'a b/d/']
'a b/' | ['a b/file.txt', 'a b/file2.txt', 'a b/b/a.txt', 'a b/d/', 'file.txt'] | ['a b/file.txt', 'a b/file2.txt', 'a b/b/a.txt', 'a b/d/']
'a b' | ['a b/file.txt', 'a b/file2.txt', 'a b/b/a.txt', 'a b/d/', 'file.txt'] | ['a b/file.txt', 'a b/file2.txt', 'a b/b/a.txt', 'a b/d/']
'a|b/*' | ['a|b/file.txt', 'a|b/file2.txt', 'a|b/b/a.txt', 'a|b/d/', 'file.txt'] | ['a|b/file.txt', 'a|b/file2.txt', 'a|b/b/a.txt', 'a|b/d/']
'a/*/c/*' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt']
'a/**/c/*' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt']
'a/*/c/' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt']
'a/**/c/' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt']
'a/*/c' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt']
'a/**/c' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt']
'a/dir*/file.txt' | ['a/dir/file.txt', 'a/dir1/file.txt', 'a/dir2/afile.txt', 'a/dir1/a/file.txt'] | ['a/dir/file.txt', 'a/dir1/file.txt']
'a/dir**/file.txt' | ['a/dir/file.txt', 'a/dir1/file.txt', 'a/dir2/afile.txt', 'a/dir1/a/file.txt', 'a/dir1/a/b/file.txt'] | ['a/dir/file.txt', 'a/dir1/file.txt', 'a/dir1/a/file.txt', 'a/dir1/a/b/file.txt']
'a/dir**file.txt' | ['a/dirafile.txt', 'a/dir/file.txt', 'a/dir1/file.txt', 'a/dir2/afile.txt', 'a/dir1/a/file.txt', 'a/dir1/a/b/file.txt'] | ['a/dirafile.txt', 'a/dir/file.txt', 'a/dir1/file.txt', 'a/dir2/afile.txt', 'a/dir1/a/file.txt', 'a/dir1/a/b/file.txt']
'a/?/c/*' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt']
'a/?/c/' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt']
'a/?/c' | ['a/b/c/file.txt', 'a/b/c/file2.txt', 'a/b/d/c/file2.txt', 'a/b/file.txt', 'a/file.txt', 'file.txt'] | ['a/b/c/file.txt', 'a/b/c/file2.txt']
'{a,b,c}/*' | ['a/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['a/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt']
'{a,b,c}/' | ['a/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['a/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt']
'{a,b,c}' | ['a/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['a/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt']
'{ab*,b*/*,c}/*' | ['a/file.txt', 'abcd/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['abcd/file2.txt']
'{a[ab]c,b*/*}/*' | ['acc/file.txt', 'abc/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['abc/file2.txt']
'{a|c*,b*/*}/*' | ['a|c/file.txt', 'abc/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['a|c/file.txt']
'[A-Z]/*' | ['A/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['A/file.txt']
'[A-Z]/' | ['A/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['A/file.txt']
'[A-Z]' | ['A/file.txt', 'a/file2.txt', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['A/file.txt']
'a' | ['A/file.txt', 'a/', 'b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['a/']
'a/b/c' | ['A/file.txt', 'a/b/c/', 'a/b/file.txt', 'b/file2.txt', 'd/file.txt', 'file.txt'] | ['a/b/c/']
'abc?\\?.txt' | ['abcde.txt', 'abcd.txt', 'abcd?.txt'] | ['abcd?.txt']
'a"b.txt' | ['a\\"b.txt', 'a"b.txt', 'abc.txt'] | ['a"b.txt']
'a/**/b/*/d.txt' | ['a/c/d/b/x/d.txt', 'a/c/d/b/x/Y/d.txt'] | ['a/c/d/b/x/d.txt']
'a/bc*h/ij*mn/*' | ['a/bcdefgh/ijklmn/a.txt', 'a/bcdfgh/ijklmn/a.txt', 'a/bcdefghi/ijklmn/a.txt'] | ['a/bcdefgh/ijklmn/a.txt', 'a/bcdfgh/ijklmn/a.txt']

The tests check if all files are found, but not more.

Let us talk about 1): How to deal with glob:false.
I have not yet implemented it.
Have you a good idea how to get the information whether an output pattern has to be treated as glob:false here:

String getUnstageOutputFilesScript(List<String> outputFiles, Path targetDir) {

My idea would be to change this:

List<String> getOutputFilesNames() {
def result = []
for( FileOutParam param : getOutputsByType(FileOutParam).keySet() ) {
result.addAll( param.getFilePatterns(context, workDir) )
}
return result.unique()
}

to:

List<Tuple2<String, Boolean>> getOutputFilesNames() {
    def result = []

    for( FileOutParam param : getOutputsByType(FileOutParam).keySet() ) {
        result.addAll( new Tuple2(param.getFilePatterns(context, workDir), param.glob) )
    }

    return result.unique()
}

and adjust all calls accordingly.
But I think there is a better way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commenting in the main thread to have more space

@pditommaso pditommaso force-pushed the master branch 5 times, most recently from 30b2ef4 to cc77472 Compare August 16, 2021 09:30
@pditommaso
Copy link
Member

I think most (all) of the problems you are reporting using the current master version (think 21.08.0-edge) is also ok. I've repeated the tests for your convenience

Test case 1

1. plain

» nextflow run test1.nf 
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test1.nf` [tiny_sanger] - revision: de0bfb4190
executor >  local (1)
[fc/f44914] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/fc/f44914e12b72b7b91240434bb02ba3/a/file.txt
/Users/pditommaso/Projects/nextflow/work/fc/f44914e12b72b7b91240434bb02ba3/c/file.txt
» tree work/fc/f44914e12b72b7b91240434bb02ba3/
work/fc/f44914e12b72b7b91240434bb02ba3/
├── a
│   └── file.txt
├── b
├── c
│   └── file.txt
├── d
│   └── file.txt
└── file.txt

OK

2. with scratch

» nextflow run test1.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test1.nf` [ecstatic_laplace] - revision: de0bfb4190
executor >  local (1)
[95/ca7161] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/95/ca7161ac3fc352037e7820c41aaeb2/a/file.txt
/Users/pditommaso/Projects/nextflow/work/95/ca7161ac3fc352037e7820c41aaeb2/c/file.txt
» tree work/95/ca7161ac3fc352037e7820c41aaeb2/
work/95/ca7161ac3fc352037e7820c41aaeb2/
├── a
│   └── file.txt
└── c
    └── file.txt

OK

Test case 2

1. plain

» nextflow run test2.nf 
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test2.nf` [festering_nightingale] - revision: 53280a4646
executor >  local (1)
[1a/64d051] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/1a/64d05110333ef2f1f5a593a3d3db31/a/b/c/file.txt
/Users/pditommaso/Projects/nextflow/work/1a/64d05110333ef2f1f5a593a3d3db31/a/b/c/file2.txt
» tree work/1a/64d05110333ef2f1f5a593a3d3db31/
work/1a/64d05110333ef2f1f5a593a3d3db31/
├── a
│   ├── b
│   │   ├── c
│   │   │   ├── file.txt
│   │   │   └── file2.txt
│   │   ├── d
│   │   │   └── c
│   │   │       └── file2.txt
│   │   └── file.txt
│   └── file.txt
└── file.txt

2. with scratch

» nextflow run test2.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test2.nf` [exotic_hodgkin] - revision: 53280a4646
executor >  local (1)
[9e/6e68b1] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/9e/6e68b11cdd7de70f51c42756ed3539/a/b/c/file.txt
/Users/pditommaso/Projects/nextflow/work/9e/6e68b11cdd7de70f51c42756ed3539/a/b/c/file2.txt
» tree work/9e/6e68b11cdd7de70f51c42756ed3539/
work/9e/6e68b11cdd7de70f51c42756ed3539/
└── a
    └── b
        └── c
            ├── file.txt
            └── file2.txt

OK

Testcase 3

1. plain

» nextflow run test3.nf 
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test3.nf` [desperate_engelbart] - revision: c4b865cd00
executor >  local (1)
[ba/8774e7] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/ba/8774e753963eb301a3e4002dafcaf8/a/dir1/file.txt
/Users/pditommaso/Projects/nextflow/work/ba/8774e753963eb301a3e4002dafcaf8/a/dir2/file.txt
» tree work/ba/
work/ba/
└── 8774e753963eb301a3e4002dafcaf8
    └── a
        ├── dir1
        │   ├── a
        │   │   └── file.txt
        │   └── file.txt
        └── dir2
            └── file.txt

OK

2. with scracth

» nextflow run test3.nf -process.scratch 
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test3.nf` [sick_woese] - revision: c4b865cd00
executor >  local (1)
[f3/3845c4] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/f3/3845c4ff5e6d38a3b2d1833af4eda5/a/dir1/file.txt
/Users/pditommaso/Projects/nextflow/work/f3/3845c4ff5e6d38a3b2d1833af4eda5/a/dir2/file.txt
» tree work/f3/3845c4ff5e6d38a3b2d1833af4eda5/
work/f3/3845c4ff5e6d38a3b2d1833af4eda5/
└── a
    ├── dir1
    │   └── file.txt
    └── dir2
        └── file.txt

Testcase 4

1. plain

» nextflow run test4.nf 
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test4.nf` [infallible_swirles] - revision: 51ac8d5bd7
executor >  local (1)
[96/9d50bf] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/96/9d50bfd89e8cb6847de9efeb6fa082/a/dir/file.txt
/Users/pditommaso/Projects/nextflow/work/96/9d50bfd89e8cb6847de9efeb6fa082/a/dir1/a/b/file.txt
/Users/pditommaso/Projects/nextflow/work/96/9d50bfd89e8cb6847de9efeb6fa082/a/dir1/a/file.txt
/Users/pditommaso/Projects/nextflow/work/96/9d50bfd89e8cb6847de9efeb6fa082/a/dir1/file.txt
/Users/pditommaso/Projects/nextflow/work/96/9d50bfd89e8cb6847de9efeb6fa082/a/dir2/file.txt
» tree work/
work/
└── 96
    └── 9d50bfd89e8cb6847de9efeb6fa082
        └── a
            ├── dir
            │   └── file.txt
            ├── dir1
            │   ├── a
            │   │   ├── b
            │   │   │   └── file.txt
            │   │   └── file.txt
            │   └── file.txt
            └── dir2
                └── file.txt

2. with scratch

» nextflow run test4.nf -process.scratch
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test4.nf` [mighty_morse] - revision: 51ac8d5bd7
executor >  local (1)
[96/6b2f7c] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/96/6b2f7c733715596db4f3592b6e3aa7/a/dir/file.txt
/Users/pditommaso/Projects/nextflow/work/96/6b2f7c733715596db4f3592b6e3aa7/a/dir1/a/b/file.txt
/Users/pditommaso/Projects/nextflow/work/96/6b2f7c733715596db4f3592b6e3aa7/a/dir1/a/file.txt
/Users/pditommaso/Projects/nextflow/work/96/6b2f7c733715596db4f3592b6e3aa7/a/dir1/file.txt
/Users/pditommaso/Projects/nextflow/work/96/6b2f7c733715596db4f3592b6e3aa7/a/dir2/file.txt
» tree work/96/6b2f7c733715596db4f3592b6e3aa7/
work/96/6b2f7c733715596db4f3592b6e3aa7/
└── a
    ├── dir
    │   └── file.txt
    ├── dir1
    │   ├── a
    │   │   ├── b
    │   │   │   └── file.txt
    │   │   └── file.txt
    │   └── file.txt
    └── dir2
        └── file.txt

Also when using ** it looks consistent, am I correct?

@pditommaso
Copy link
Member

pditommaso commented Oct 9, 2021

Regarding glob:false, I've created PR implementing a variation of your suggestion (ie. using custom class to hold the glob flag), tho it was more tricky than expected. You can find it here #2379

@Lehmann-Fabian
Copy link
Contributor Author

Ahh, somehow, Nextflow did not execute the latest version, even when I downloaded it.
Anyways, I built your new branch running three new test cases I took from the tests I implemented to validate my solution.
There are still some edge cases working inconsistently. Moreover, using ** copies more data than necessary, reducing the performance.

Testcase 1

process a {

    output:
    path "**.txt" into found

    """
    mkdir -p a
    touch file.txt
    touch file1.txt
    touch a/file.txt
    touch a/file.text
    ls
    """

}
found.flatten().view()

1.

$ ./nextflow-paolofix run test1.nf
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test1.nf` [zen_babbage] - revision: ade87762e2
executor >  local (1)
[0a/a3f279] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/0a/a3f279f8f8349d7f656fd12a2c7080/a/file.txt
/home/fabian/extendedGlob/work/0a/a3f279f8f8349d7f656fd12a2c7080/file.txt
/home/fabian/extendedGlob/work/0a/a3f279f8f8349d7f656fd12a2c7080/file1.txt

Files in workdir:

work/0a/a3f279f8f8349d7f656fd12a2c7080/
├── a
│   ├── file.text
│   └── file.txt
├── file1.txt
└── file.txt

1 directory, 4 files

2.

$ ./nextflow-paolofix run test1.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test1.nf` [admiring_goodall] - revision: ade87762e2
executor >  local (1)
[04/ce6b73] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/04/ce6b73d231f809cab339d6f8e23c82/a/file.txt
/home/fabian/extendedGlob/work/04/ce6b73d231f809cab339d6f8e23c82/file.txt
/home/fabian/extendedGlob/work/04/ce6b73d231f809cab339d6f8e23c82/file1.txt

Files in workdir:

work/04/ce6b73d231f809cab339d6f8e23c82/
├── a
│   ├── file.text
│   └── file.txt
├── file1.txt
└── file.txt

1 directory, 4 files

Here to much data is copied. This could be problematic, as it could be dozens of gigabytes of temporary data. My solution only copies the required files:

3.

$ ./nextflow-fix run test1.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test1.nf` [magical_jepsen] - revision: ade87762e2
executor >  local (1)
[e6/16ed93] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/e6/16ed93e155f0542837f2cb6f5c9c1f/a/file.txt
/home/fabian/extendedGlob/work/e6/16ed93e155f0542837f2cb6f5c9c1f/file.txt
/home/fabian/extendedGlob/work/e6/16ed93e155f0542837f2cb6f5c9c1f/file1.txt

Files in workdir:

work/e6/16ed93e155f0542837f2cb6f5c9c1f/
├── a
│   └── file.txt
├── file1.txt
└── file.txt

1 directory, 3 files

Testcase 2

process a {

    output:
    path "abc?\\?.txt" into found

    """
    touch abcde.txt
    touch abcd.txt
    touch abcd\\?.txt
    ls
    """

}
found.flatten().view()

1.

$ ./nextflow-paolofix run test2.nf
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test2.nf` [gigantic_solvay] - revision: 7c108fd5e5
executor >  local (1)
[1a/3434ad] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/1a/3434ad29ebbd745f0c2f89077ecc4a/abcd?.txt

Files in workdir:

work/1a/3434ad29ebbd745f0c2f89077ecc4a/
├── abcde.txt
├── abcd.txt
└── abcd?.txt

0 directories, 3 files

2.

$ ./nextflow-paolofix run test2.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test2.nf` [reverent_mccarthy] - revision: 7c108fd5e5
executor >  local (1)
[81/90514a] process > a [100%] 1 of 1, failed: 1 ✘
Error executing process > 'a'

Caused by:
  Missing output file(s) `abc?\?.txt` expected by process `a`

Command executed:

  touch abcde.txt
  touch abcd.txt
  touch abcd\?.txt
  ls

Command exit status:
  0

Command output:
  abcde.txt
  abcd.txt
  abcd?.txt

Work dir:
  /home/fabian/extendedGlob/work/81/90514ab4435d5dfd4430bdb62b7586

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line



Files in workdir:

work/81/90514ab4435d5dfd4430bdb62b7586/
└── abcde.txt

0 directories, 1 file

In this case the wrong file is copied, while my solution finds it:

3.

$ ./nextflow-fix run test2.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test2.nf` [curious_sanger] - revision: 7c108fd5e5
executor >  local (1)
[b7/6eb31f] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/b7/6eb31f6072103d251e569c193ee8dd/abcd?.txt

Files in workdir:

work/b7/6eb31f6072103d251e569c193ee8dd/
└── abcd?.txt

0 directories, 1 file

Testcase 3

process a {

    output:
    path 'a"b.txt' into found

    """
    touch a\\"b.txt
    touch abc.txt
    ls
    """

}
found.flatten().view()

1.

$ ./nextflow-paolofix run test3.nf
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test3.nf` [golden_euclid] - revision: cbe0fca9e4
executor >  local (1)
[f1/490bb6] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/f1/490bb6facbaf18bdd031210265e785/a"b.txt

Files in workdir:

work/f1/490bb6facbaf18bdd031210265e785/
├── abc.txt
└── a"b.txt

0 directories, 2 files

2.

$ ./nextflow-paolofix run test3.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test3.nf` [irreverent_kilby] - revision: cbe0fca9e4
executor >  local (1)
[19/c0c91f] process > a [100%] 1 of 1, failed: 1 ✘
Error executing process > 'a'

Caused by:
  Missing output file(s) `a"b.txt` expected by process `a`

Command executed:

  touch a\"b.txt
  touch abc.txt
  ls

Command exit status:
  0

Command output:
  abc.txt
  a"b.txt

Work dir:
  /home/fabian/extendedGlob/work/19/c0c91feda1ad1d3a94ac8137bc0025

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`



Files in workdir:

work/19/c0c91feda1ad1d3a94ac8137bc0025/

0 directories, 0 files

Again, the file is not copied and accordingly the pipeline crashed. My solution worked well:

3.

$ ./nextflow-fix run test3.nf -process.scratch true
N E X T F L O W  ~  version 21.09.0-SNAPSHOT
Launching `test3.nf` [backstabbing_wilson] - revision: cbe0fca9e4
executor >  local (1)
[9d/7495cf] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/9d/7495cfb7be01ccf5e649406861f835/a"b.txt

Files in workdir:

work/9d/7495cfb7be01ccf5e649406861f835/
└── a"b.txt

0 directories, 1 file

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 6564123 to c78965b Compare October 11, 2021 15:30
@pditommaso
Copy link
Member

regarding the first, should it traverse dirs? I'm getting this

bash-5.1#     mkdir -p a
    touch file.txt
    touch file1.txt
    touch a/file.txt
    touch a/file.text
bash-5.1# shopt -s globstar
bash-5.1# ls **.txt
file.txt   file1.txt

@Lehmann-Fabian
Copy link
Contributor Author

Yes, the first run for the first test case is without -process.scratch true, and Nextflow finds the file. Accordingly, my implementation should do as well.
My implementation is entirely consistent with Nextflow's one without using scratch dirs. I put such a focus on it since it took me a lot of time to find out why one of my workflows behaved differently on K8s than locally.
Unix globstar does not interpret ** like Java's definition. Therefore, I added the following replacement:


With that change globstar always traverses through all levels.

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from acb53ee to 2c6665e Compare November 20, 2021 17:52
@Lehmann-Fabian
Copy link
Contributor Author

Hi @pditommaso. As the PR is still open, I wanted to ask you how to proceed.
In PR #2035 you mentioned, that you could imagine implementing this as an optional option: copy-allow-globs. Therefore I would leave the SimpleFileCopyStrategy.groovy as it was and extend it as GlobFileCopyStrategy with the proposed changes using this class only if copy-allow-globs is set to true. So no side effects could occur. But the user has an opportunity to overcome the previously mentioned inconsistencies and increase efficiency by reducing the amount of copied data.

Otherwise, we can close the PR.

@pditommaso
Copy link
Member

I think there are at least two issues here: 1) the correct handling of glob:false and 2) the optimal handling of extended globs.

  1. I think it's preferable going for the PR I made Fix non glob escape #2379
  2. The use of new unstage copy mode could be a good solution, tho don't think introducing a new GlobFileCopyStrategy can help because it cannot be instantiated depending the unstage mode needed by the task

@Lehmann-Fabian
Copy link
Contributor Author

  1. I agree.

  2. Not sure what you mean.
    Should I bring the logic I proposed here into the Fix non glob escape #2379 PR?
    a) Using it always if glob: true
    b) Using it only for outputPathes containing ** -> would still have problems with Testcase 2 and 3 here: Fix extended glob #2182 (comment)
    c) Introducing a new mode moveGlob/copyGlob, which would behave like the move/copy if I do not use -process.scratch true

@pditommaso
Copy link
Member

Good point. Yes, I think the support for extended glob should be done as PR forked from #2379.

Regarding the second point ideally, it should be the default behaviour i.e. case a) in your comment. Though, I don't have a clean idea of the impact of this.

Lehmann-Fabian added a commit to Lehmann-Fabian/nextflow that referenced this pull request Jan 11, 2022
@Lehmann-Fabian
Copy link
Contributor Author

Okay, I created PR #2551 to merge the logic into PR #2379.
Let us continue to discuss it in the new PR.
So, can we close this PR here?

@pditommaso
Copy link
Member

Ok, thanks

@pditommaso pditommaso closed this Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants