Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode #2979

Closed
7 tasks done
oliver-sanders opened this issue Mar 7, 2019 · 13 comments
Closed
7 tasks done

unicode #2979

oliver-sanders opened this issue Mar 7, 2019 · 13 comments
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Mar 7, 2019

Follow on from #2966 to cover unicode support in Cylc.

What unicode should Cylc support and where?

*native: Python3 defaults to the OS "default encoding" for file opens which it typically UTF-8, 3.7+ has a UTF_8 flag which overrides this. Consequently these came for free with the Python3 upgrade.

@hjoliver
Copy link
Member

hjoliver commented Aug 6, 2019

suite names

@oliver-sanders - suite names would be a good one to nail down right away. They are currently completely unrestricted, as you've noted.

Motivation: @dwsutherland needs a delimiter for these IDs in the code:
https://github.com/cylc/cylc-flow/pull/3202/files#diff-3426a12a1f378e02eac6fb12a9610a23R102
He's currently using / but hierarchical suite names are going to break that.

The minimal set of Unicode characters is probably alphanumeric plus dash, underscore, forward-slash:

'^[\w\-_/]+$'

Then @dwsutherland could use % (say) as a delimiter in the code.

Is there any compelling reason to allow more than this in suite names?

@matthewrmshin
Copy link
Contributor

The only other characters that we may want to support are .+@.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 6, 2019

OK, I've quickly run some characters from different unicode tables against Python re.

The regex:

  • ^[\w_\.+@-]+$

Changes we might want to make:

  • Include =, %
  • Prohibit - and . from being the first character

Conclusions:

  • Text and number characters seem to be well matched.
  • Punctuation characters are not matched.
  • Emoji are not word characters.

I think we are good to go with this Regex?

Matching Chars

basic latin

abc
ABC
0123

explictly supported special chars

_.+@-

latin supplement

àðØ

latin extended

ĐĵŌ

IPA extensions

ɐɶʍ

greek and coptic

ΘχϢ

anchient greek numbers

𐅅𐅉𐅌

cryllic

ФШѸ

armenian

ԱՔփ

Non-Matching Chars

emoji


😅
🐼

special chars


!

/

~
:
|

<

@oliver-sanders
Copy link
Member Author

Motivation: @dwsutherland needs a delimiter for these IDs in the code:
https://github.com/cylc/cylc-flow/pull/3202/files#diff-3426a12a1f378e02eac6fb12a9610a23R102

How about :, it is valid in Unix filenames, but obviously not a sensible choice.

@matthewrmshin
Copy link
Contributor

The colon : is used in PATH-like variables, so no good for anything else.

@oliver-sanders
Copy link
Member Author

Exactly!

@sadielbartholomew
Copy link
Collaborator

sadielbartholomew commented Aug 6, 2019

See also #3117 where I investigated & documented task & (to some extent) suite names that work normally, which is now in the docs here & here.

I suggest we have the same limitations on suite name as on task name, as documented above, as largely in agreement with the comments above.

Note there is also a maximum length that suite names can safely be, given the OS restriction on file names length. It might sound a bit over the top to point this out, but I have seen some exceedingly long suite names across the MO. Can we validate on length too for extra care?

@oliver-sanders
Copy link
Member Author

Can we validate on length too for extra care?

We can check for number of characters easily enough (can even do it in a regex), unfortunately most file-system file-name limits are in bytes.

https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits

We could just slap on a 255 character limit anyway.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 6, 2019

See open questions on #3274

@dwsutherland
Copy link
Member

Could we use | as the external ID delimiter?

@dwsutherland
Copy link
Member

Note - The other reason why I was originally vying for / or . was to be consistent with CLI task/job args. But as the external ID has to contain suite and owner/user info, this issue has come about.

@hjoliver
Copy link
Member

hjoliver commented Aug 7, 2019

Could we use | as the external ID delimiter?

I suppose we could (but it looks annoyingly like pipe or OR to me). I think we can go with the possibly-temporary solution and make a final decision once allowed suite-name chars are defined (within a day or so, I expect).

@oliver-sanders
Copy link
Member Author

All items addressed, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants