Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for hostname filters in standard mode #181

Closed
mjethani opened this issue Jul 16, 2021 · 8 comments
Closed

Add support for hostname filters in standard mode #181

mjethani opened this issue Jul 16, 2021 · 8 comments

Comments

@mjethani
Copy link

There was an issue in EasyList recently that broke some ad blockers. Even though the report doesn't mention Brave, the issue should have affected Brave too. Meanwhile, uBlock Origin was barely affected because it interpreted the filter in question as a hostname.

Now this raises the question as to how such filters should be interpreted in general.

The Brave implementation has two modes: standard and hosts. In hosts mode, a hostname on a line by itself is interpreted as a hostname. In standard mode, the same hostname must be specified using Adblock Plus syntax. For example, example.com must be encoded as ||example.com^.

Should Brave adopt the uBlock Origin style in standard mode?

The crux of the issue is that Adblock Plus-style filters tend to overblock by default. A filter like ad.png would block not only the URL https://example.com/ad.png (intended) but also the URL https://example.com/ipad.png (not intended). Similarly, the filter example.com would block https://example.com/ad.png (intended) but would also block https://example.computer/ad.png (not intended).

Whereas filter developers are aware of such nuances, and indeed there are no such filters in EasyList et al, the average user who types a filter into a box (brave/brave-browser#8838) is unlikely to go look up the syntax, learn it, and then correctly specify the hostname example.com as ||example.com^. In other words, the Adblock Plus syntax is not exactly user-friendly. Based on the recent incident, it also appears to be a bit of a footgun. We don't know how many end users are experiencing broken web pages because they were savvy enough to write their own filters by hand.

The principle of least surprise dictates that if the pattern in a URL filter looks like a hostname, it should be interpreted as a hostname.

$ grep -Ei '^(@@)?[a-z0-9]([a-z0-9-]*[a-z0-9])?(\.[a-z0-9]([a-z0-9-]*[a-z0-9])?)*(\$.*)?$' easylist.txt easyprivacy.txt | wc -l
0

This new "shorthand" syntax for hostnames would break 0 existing filters in EasyList and EasyPrivacy.

$ grep -Ei '^(@@)?\|\|[a-z0-9]([a-z0-9-]*[a-z0-9])?(\.[a-z0-9]([a-z0-9-]*[a-z0-9])?)*\^(\$.*)?$' easylist.txt easyprivacy.txt | wc -l
31825

On the other hand, there are nearly 32,000 filters in EasyList and EasyPrivacy, out of a combined total of about 81,000 filters, that could be converted to use the new syntax at some point, reducing the size of the raw filter text by ~100 KiB.

Not only would the shorthand syntax be more user-friendly, it might also have bandwidth, memory, and disk usage benefits over time.

WDYT?

@mjethani
Copy link
Author

@gorhill CMIIW

uBlock Origin appears to literally translate example.com into ||example.com^. This means it still matches non-hostname parts of the URL. I feel like a hostname filter should be only about the hostname.

Maybe it doesn't matter in practice, but then it makes you wonder why the extended anchor (||) needs to exist at all.

@gorhill
Copy link

gorhill commented Jul 17, 2021

uBlock Origin appears to literally translate example.com into ||example.com^

Yes, the parsing makes it internally an equivalent of the ||...^ form. However I don't understand your "still matches non-hostname parts".

@mjethani
Copy link
Author

I don't understand your "still matches non-hostname parts"

The way it was implemented in Adblock Plus was that the || would match the regular expression ^[\w\-]+:\/+(?!\/)(?:[^\/]+\.)?. This meant that the filter ||example.com^ would match https://foo.example.com/, not just foo.example.com, in the URL https://foo.example.com/bar/. This would matter if there was some kind of substitution (rewriting of parts of the URL), or if there were possible URLs in which the protocol was followed directly by the hostname with no slashes in between, but since neither of these is the case then it doesn't matter.

But then what's the use of the ||?

Filters could be interpreted such that any hostname characters at the beginning of the pattern would automatically make up the hostname. example.com/foo/ could always be interpreted as ||example.com/foo/. uBlock Origin doesn't seem to do this at the moment. If the || is lost entirely it might save a lot of space.

@antonok-edm
Copy link
Collaborator

@mjethani I'm not convinced that is the right move; it's definitely a tradeoff either way.

The majority of filter list maintenance and adblock engine development is done by a handful of people who are very familiar with ABP syntax. There's already a fair amount of... unintuitive features in ABP syntax, and additional special-cased parsing logic causes more surprises for the experts, not less.

That breakage appears to have been a tiny mistake introduced with easylist/easylist@1e83bda. @ryanbr is solidly on the "pro" end of the filter list maintainer spectrum, and he probably knows even more than I do about how ABP syntax works!

I think it's more important that the mistakes are caught and fixed in a timely manner when they happen. If it weren't for the breakage you referenced, that media filter and possibly others could hang around in lists for a long time. Having more "incorrect" filters loaded takes more bandwidth to download and will decrease the performance of every adblocker as well.

@ryanbr
Copy link
Collaborator

ryanbr commented Jul 20, 2021

I implement a commit fix to trim any filters smaller than 9chars long to avoid this in the future. Was a mistake on my part, but was fixed within a couple of hours.

@mjethani
Copy link
Author

@antonok-edm @ryanbr thanks for taking the time to comment on this.

@antonok-edm wrote:

I think it's more important that the mistakes are caught and fixed in a timely manner when they happen.

Perhaps it was not such a great idea to open with that incident as an example. Indeed, it has little to do with this enhancement request. It was only what got me thinking on this topic.

The takeaway from the incident really is that different filter engines have different syntaxes and this itself could become a source of confusion among filter developers.

The more general idea here is that Brave should make improvements to the current syntax. It's not uncommon for an implementation to do this. For example, Adblock Plus changed how it interprets non-ASCII characters and the trailing dot. These little improvements start to make sense over time as the web becomes more standardized.

… additional special-cased parsing logic causes more surprises for the experts, not less.

This specific idea of interpreting example.com as ||example.com^ should be considered on its own merit and perhaps as part of a larger set of syntax improvements.

@mjethani
Copy link
Author

As it turns out, the Adblock Plus project seems to have decided all of a sudden that it wants to be compatible with uBlock Origin too.

This is a total coincidence of course.

@mjethani
Copy link
Author

… perhaps as part of a larger set of syntax improvements.

Here's what I had in mind.

Hostnames:

  • Level 1: If a filter looks like a hostname, it's treated as a hostname. e.g. example.com => ||example.com^
  • Level 2: If the pattern in a filter looks like a hostname, the pattern is treated as a hostname. e.g. example.com$~third-party => ||example.com^$~third-party
  • Level 3: The initial hostname characters in the pattern are treated as the hostname. e.g. example.com/foo.js$script => ||example.com/foo.js$script

With the above, the || would become redundant.

uBlock Origin is already at level 2.

Word boundaries:

  • If a pattern ends with an alphanumeric character, it is treated as if it is followed by a ^. e.g. /ad.js => /ad.js^. The benefit of this would be that it would not block the URL https://example.com/ad.json.

These ideas are based on what I see in EasyList et al at the moment.

@antonok-edm I think I see your point that just the one change might be a hard sell. Maybe it's better to come up with a proposal for an entire set of changes of this nature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants