Some maintenance work #77

lyrixx · 2022-05-30T14:58:03Z

Allow org_heigl/hyphenator in ^3.0
Simplify composer.json 'scripts' section
Drop support for Symfony < 4.4 + Add support for ^6.0
Run tests on PHP 8.1

lyrixx · 2022-05-30T14:58:57Z

there is a failure on PHP 8.1, but I don't know how to fix it:

1) JoliTypo\Tests\EnglishTest::testFixFullText
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
 <h3>Pronun&shy;ci&shy;ation</h3>\n
 \n
 <p>A humor&shy;ous image announ&shy;cing the launch of a White House Tumblr suggests pronoun&shy;cing GIF with a hard &ldquo;G&rdquo;.</p>\n
-<p>The creat&shy;ors of the format pronounced GIF as &ldquo;Jif&rdquo; with a soft &ldquo;G&rdquo; /&#712;d&#658;&#618;f/ as in &ldquo;gin&rdquo;.</p>\n
-<p>An altern&shy;at&shy;ive pronun&shy;ci&shy;ation with a hard &ldquo;G&rdquo; /&#712;&#609;&#618;f/ as in &ldquo;graph&shy;ics&rdquo;, reflect&shy;ing the expan&shy;ded acronym, is in wide&shy;spread usage.</p>\n
+<p>The creat&shy;ors of the format pronounced GIF as &ldquo;Jif&rdquo; with a soft &ldquo;G&rdquo; /&Euml;&circ;d&Ecirc;&rsquo;&Eacute;&ordf;f/ as in &ldquo;gin&rdquo;.</p>\n
+<p>An altern&shy;at&shy;ive pronun&shy;ci&shy;ation with a hard &ldquo;G&rdquo; /&Euml;&circ;&Eacute;&iexcl;&Eacute;&ordf;f/ as in &ldquo;graph&shy;ics&rdquo;, reflect&shy;ing the expan&shy;ded acronym, is in wide&shy;spread usage.</p>\n
 <p>Both pronun&shy;ci&shy;ations are acknow&shy;ledged by the [&hellip;] Merriam-Webster&rsquo;s Collegi&shy;ate Diction&shy;ary.</p>\n
 \n
 <p>We also have &ldquo;<span>HTML in quote</span>&rdquo; to fix&hellip;</p>'

HedicGuibert · 2022-05-31T16:35:54Z

The line causing the bug is this one :
https://github.com/jolicode/JoliTypo/blob/master/src/JoliTypo/Fixer.php#L311

For an uknown reason (from me at least), the behaviour of this function changed in PHP 8.1.
I tried running this code in both PHP 8.0 and PHP 8.1

$tofix = "/ˈdʒɪf/";

echo(mb_detect_encoding($tofix) . PHP_EOL);

mb_detect_order('ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15');
echo(mb_detect_encoding($tofix));

In PHP 8.0 I get

UFT-8
UTF-8

In PHP 8.1 I get

UTF-8
Windows-1252

Sandbox showing the issue : https://onlinephp.io/c/5f836

HedicGuibert · 2022-06-01T07:25:13Z

So after doing some more research I found out that the function that has changed is not mb_encoding_order but mb_detect_encoding. In this bug report we can see the following answer from Alex Dowad :

Question :

I'm guessing that in earlier versions there was no such thing as demerits (or the algo was even more different), so the provided priority list had more impact on the result. Is that right?

Answer :

Yep. The earlier versions of mb_detect_encoding only checked which encodings
the input string was valid in, and picked the first one on the list. If the string was valid in more than one encoding, it did not do anything at all to try to figure out which one was most likely.

So now mb_detect_encoding just returns the most likely encoding and doesn't care about the order they were given. This is confirmed by this comment : php/php-src#8279 (comment)

Now it also applies heuristics to detect which of the valid text encodings in the specified list (if there are more than one) is most likely to be correct.

This is not what we want to do in this library. We do not want to get the most likely encoding, we want to validate that the text we are given is valid in some encodings, with some preferences.
Therefore I suggest we test the encodings one by one, probably in a loop like suggested in the comment above. This would restore our previous behaviour.

HedicGuibert · 2022-06-04T08:45:40Z

I tried fixing it :

fix a test that was not working anymore (now it matches the other similar tests).
remove all the now useless mb_detect_order.
found a workaround for mb_detect_encoding to detect UTF-8 in priority but it is now returning UTF-8 all the time. This is what we had before and it seems to be ok.
removed encoding to HTML entites. It is deprecated to do it with mb_convert_encoding in PHP 8.2 (php/php-src@9308974). We actually don't want to decode thinks like <3 to <3 because then libxml sees an open HTML tag. It was working before because we took advantage of the weird behaviour of mb_detect_encoding with some HTML entites (more info on this in the commit message above).

lyrixx · 2022-06-04T09:28:46Z

Awesome work!

I'll take care of the deprecation if you want.
Or if you want, you must add: https://github.com/symfony/recipes/blob/main/symfony/framework-bundle/5.3/config/packages/framework.yaml#L5

damienalexandre · 2022-06-04T09:44:03Z

tests/JoliTypo/Tests/Html5Test.php

@@ -46,7 +46,7 @@ public function testFullPageMarkup()
            HTML;

        $fixed = <<<'STRING'
-            &#8220;Who Let the Dogs Out?&#8221; is a song written and originally recorded by Anslem Douglas (titled &#8220;Doggie&#8221;).
+            “Who Let the Dogs Out?” is a song written and originally recorded by Anslem Douglas (titled “Doggie”).


Why does the behaviour changed here?! I don't get it :/

I don't know, encoding seems to have had numerous changes in PHP 8.1.
However, this is what we expected to get in similar tests :
https://github.com/jolicode/JoliTypo/blob/master/tests/JoliTypo/Tests/Fixer/EnglishQuotesTest.php#L42-L45
https://github.com/jolicode/JoliTypo/blob/master/tests/JoliTypo/Tests/JoliTypoTest.php#L175

The new version look good to me 👍🏼

HedicGuibert · 2022-06-04T10:32:33Z

src/JoliTypo/Fixer.php

@@ -356,7 +361,9 @@ private function fixContentEncoding($content)
                    mb_substr($content, $headPos);
            }

-            $content = mb_convert_encoding($content, 'HTML-ENTITIES', $encoding);
+            if ('UTF-8' !== $encoding) {
+                $content = mb_convert_encoding($content, 'UTF-8', $encoding);


Actually i'm not sure about this

Yes it should be HTML-ENTITIES

And there is no need to test if it's UTF-8.

Using mb_convert_encoding to convert to HTML entities is deprecated in PHP 8.2 and did not function well previously : php/php-src@9308974

And we don't want to ue html_entity_decode because it will break the fixer if the user pass something like 1 > 3 or <3.

My concern was more about the fact that we set the charset to $encoding but then we encode the content to UTF-8. This is weird. Either we don't set the charset or we don't convert IMO.

damienalexandre · 2022-11-20T21:15:55Z

Thank you both 👍

lyrixx added 4 commits May 30, 2022 16:48

Allow org_heigl/hyphenator in ^3.0

89f27e5

Simplify composer.json 'scripts' section

4a456a7

Drop support for Symfony < 4.4 + Add support for ^6.0

aea4cac

Run tests on PHP 8.1

a3269c8

Fix encoding issues

4b458ba

HedicGuibert force-pushed the php8.1 branch from a128993 to 4b458ba Compare June 4, 2022 09:03

damienalexandre reviewed Jun 4, 2022

View reviewed changes

disable http_method_override

5eaf040

HedicGuibert reviewed Jun 4, 2022

View reviewed changes

Use NULL as default encoding if no detected match + CS

8c00c9d

damienalexandre merged commit 9d0b3cc into master Nov 20, 2022

damienalexandre deleted the php8.1 branch November 20, 2022 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some maintenance work #77

Some maintenance work #77

lyrixx commented May 30, 2022

lyrixx commented May 30, 2022

HedicGuibert commented May 31, 2022

HedicGuibert commented Jun 1, 2022

HedicGuibert commented Jun 4, 2022

lyrixx commented Jun 4, 2022

damienalexandre Jun 4, 2022

HedicGuibert Jun 4, 2022

lyrixx Jul 1, 2022

HedicGuibert Jun 4, 2022

damienalexandre Jun 4, 2022

damienalexandre Jun 4, 2022

HedicGuibert Jun 4, 2022

damienalexandre commented Nov 20, 2022

Some maintenance work #77

Some maintenance work #77

Conversation

lyrixx commented May 30, 2022

lyrixx commented May 30, 2022

HedicGuibert commented May 31, 2022

HedicGuibert commented Jun 1, 2022

HedicGuibert commented Jun 4, 2022

lyrixx commented Jun 4, 2022

damienalexandre Jun 4, 2022

Choose a reason for hiding this comment

HedicGuibert Jun 4, 2022

Choose a reason for hiding this comment

lyrixx Jul 1, 2022

Choose a reason for hiding this comment

HedicGuibert Jun 4, 2022

Choose a reason for hiding this comment

damienalexandre Jun 4, 2022

Choose a reason for hiding this comment

damienalexandre Jun 4, 2022

Choose a reason for hiding this comment

HedicGuibert Jun 4, 2022

Choose a reason for hiding this comment

damienalexandre commented Nov 20, 2022