normalize method can't handle URLs with punycoded TLD #28

walro · 2015-04-28T11:23:02Z

[12] pry(main)> Twingly::URL::Normalizer.normalize("http://xn--80aesdcplhhhb0k.xn--p1ai/")
=> []

Page loads fine in Chrome though, whois works fine too.

walro · 2015-04-28T11:43:15Z

It was pointed out that the TLD is really funky here.

dentarg · 2015-04-28T11:49:25Z

Strange, xn--p1ai (or рф) is in the public suffix list

I tested too, with public_suffix 1.5.1, got the same as above.

Don't really know of public_suffix work, if that above should be enough to support xn--p1ai or if they are missing something.

Could it be an encoding issue?

walro · 2015-04-28T12:55:21Z

Could it be an encoding issue?

Yeah, to get it working (with public-suffix) we need to go from punycode back to utf-8:

irb(main):005:0> PublicSuffix.valid?("xn--80aesdcplhhhb0k.xn--p1ai")
=> false
irb(main):006:0> PublicSuffix.valid?("domain.рф")
=> true

Public suffix won't add support for it: weppos/publicsuffix-ruby#24

We could use https://github.com/mmriis/simpleidn (I found other, even less maintained, alternatives too) to do this ourselves.

jage · 2015-05-12T12:07:50Z

We should analyze our data and how many punycode TLDs do we have.

walro added the bug label Apr 28, 2015

walro changed the title ~~normalize method can't handle certain IDNs~~ normalize method can't handle Punycoded urls May 6, 2015

walro added the critical label May 6, 2015

jage removed the critical label May 11, 2015

dentarg changed the title ~~normalize method can't handle Punycoded urls~~ normalize method can't handle URLs with punycoded TLD May 12, 2015

roback mentioned this issue Sep 9, 2015

Sync known behaviour with .NET #37

Merged

roback closed this as completed in d39a959 Sep 9, 2015

Provide feedback