[[:punct:]] and \p{Punct} #42

k-takata · 2014-08-09T00:58:25Z

Perl's document (perlrecharclass) says that:

\p{PosixPunct} and [[:punct:]] in the ASCII range match all non-controls, non-alphanumeric, non-space characters: [-!"#$%&'()*+,./:;<=>?@[\\\]^_{|}~]`

The similarly named property, \p{Punct} , matches a somewhat different set in the ASCII range, namely [-!"#%&'()*,./:;?@[\\\]_{}]. That is, it is missing the nine characters [$+<=>^|~]`.

In current Onigmo, [[:punct:]] and \p{Punct} is the same in the ASCII range and they depend on the encoding.
If the encoding is Unicode encoding, [[:punct:]] and \p{Punct} don't match the nine characters.
If the encoding is not Unicode encoding, [[:punct:]] and \p{Punct} match the nine characters.

Is it OK?

The text was updated successfully, but these errors were encountered:

tom-lord · 2015-02-24T11:31:15Z

I think this is wrong; both Unicode and non-Unicode should match the nine characters.

http://search.cpan.org/~shay/perl-5.20.2/pod/perlreref.pod

I believe the difference should actually be that under Unicode enoding, [[:punct:]] should additionally match non-ASCII punctuation. The "symbols" ($+<=>^|~`) should always be matched.

Now /(?u)[[:punct:]]/ and /\p{XPosixPunct}/ have the same meaning when Unicode encodings are used. On the other hand, /\p{Punct}/ is not changed. /(?u)[[:punct:]]/ == /\p{XPosixPunct}/ == /[\p{Punct}$+<=>^`|~]/ \p{XPosixPunct} can be used only with Unicode encodings. For other encodings, /[[:punct:]]/ is the same with /\p{Punct}/. They both includes the nine characters: "$+<=>^`|~".

k-takata · 2016-10-19T12:38:25Z

I have decided to change the behavior of [[:punct:]] on Unicode encodings, and already committed into devel-6.0 branch.
Now [[:punct:]] matches the nine characters $+<=>^`|~ on all encodings.
New property \p{XPosixPunct} can be used on Unicode encodings. This is the same as (?u)[[:punct:]].
However \p{Punct} still works differently on Unicode encodings and non-Unicode encodings. It matches the nine characters on non-Unicode encodings, and doesn't match on Unicode encodings.

k-takata · 2016-10-19T12:38:51Z

Closing.

k-takata · 2016-12-01T16:25:43Z

Related: https://bugs.ruby-lang.org/issues/12577

k-takata added the spec label Aug 9, 2014

k-takata closed this as completed Oct 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[[:punct:]] and \p{Punct} #42

[[:punct:]] and \p{Punct} #42

k-takata commented Aug 9, 2014

tom-lord commented Feb 24, 2015

k-takata commented Oct 19, 2016 •

edited

Loading

k-takata commented Oct 19, 2016

k-takata commented Dec 1, 2016

[[:punct:]] and \p{Punct} #42

[[:punct:]] and \p{Punct} #42

Comments

k-takata commented Aug 9, 2014

tom-lord commented Feb 24, 2015

k-takata commented Oct 19, 2016 • edited Loading

k-takata commented Oct 19, 2016

k-takata commented Dec 1, 2016

k-takata commented Oct 19, 2016 •

edited

Loading