-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peer review process for referral spam hosts #26
Comments
I agree with this. We have a fairly good list at this point, we should be careful with new additions. We could decide that every new issue or pull request needs a +1 from another person before being accepted. Even if that means some additions will have to wait for a few days I think it's fine. Ping @mattab |
Good question. I think we can merge PRs sometimes on first read, when there is only a few domains and names look spammy, or eg. when PR author explains how she found the spammers (eg. in GA/Piwik reports, 100% bounce rate, display spam, dodgy whois, found on another referrer spam blacklist, etc.) If we're not sure, then sounds good to ask other users to +1 if they also see this spammer and merge after a +1 was commented. Maybe we leave this issue opened for a while & see how this evolves? |
Also I'm seeing some pull requests for larger lists, perhaps each PR should be limited to one domain/url each, that way they can be individually vetted? |
I have noticed spammers usually spam a lot of different domains from the same IPs. Once an IP has spammed at least one domain in the blacklist, it is easy to find new domains being spammed (by grepping the IPs on server logs), and add them to the list, without any risk of false positive. I have automated the research of new domains using this approach and the result is in pull request #87. |
FYI we have been contacted by a webmaster asking for his website to be removed from the list: #90 (see the details in the pull request). I think this mistake (if it is one) should be one more reason to move to a "peer-review-only" kind of process, i.e. add only sites that have been reported or approved by at least 2 people. We should also document in the README that it's better to add one site at a time in pull request (I'll do it straight away), we should avoid "bulk changes" because they are harder to validate. Thoughts? |
I can only speak for myself, but I have seen in the recent months an important increase in referer spam. They spam from dozens of IPs a lot of different domains, sometimes without any rate limiting, so I get bursts of dozens of useless requests per second, polluting my analytics and wasting my server ressources. And this is on small servers hosting a few low traffic sites. As soon as I detect referer spam from an IP, I now automatically block it at the firewall level, despite that I see new domains being spammed from new IPs every day. Most of theses domains are registered for a short period of time, are simple redirects, and the spammers will always register new ones to spam. I don't use Piwik, but I find this list very useful, however let's be honest: if you require a separate pull request and a vote on every domain added, this list will not be updated frequently (if at all), and it will become useless before a few months. |
Up to now most pull requests (that have been merged) contained only a single domain (because we also add ourselves the domain reported in issues). If spam is more and more an issue, there will be more and more people looking for solutions, and thus contributing here. When we started working on a new solution against referrer spam I suggested the following idea: build a submission system where users can submit new spammers directly from inside their Piwik. These submissions would be sent to a simple app hosted somewhere (e.g. spam.piwik.org). Then it would be easy to see how many users reported each spammer domain, and above a threshold (or manually) we could add the domain to the blacklist. |
I hope this repository will get enough activity to make this list useful, but I fear the spammers will always be faster than you. Anyway, since it's in the public domain, I will maintain and use my fork, and merge back changes from this list. |
@mnapoli FYI I was able to detect it, and add it to the list again because of the other domains being spammed from the same IP. See an excerpt of my server logs:
The webmaster that contacted you is probably contracting a shady SEO company using a botnet to send massive referer spam without his knowledge. |
Could be that indeed. Or it could be that there are multiple websites hosted on the same machine? Or multiple servers behind the same IP? |
Mhh, not sure I understand you. EDIT: I realize I may not have given you enough context: the above log excerpt is from a server I own, which hosts very small websites, unrelated to the referers you see. |
Sorry it's late :/ Rephrasing my thoughts better: the spammer tool (whatever it's form) could run from a server which has the same IP address as valid websites. For example it would be very easy to write a referrer spammer script that runs on any shared host. Thus blocking based on the IP address might not always be reliable. |
Even if a spammer script is running on a shared host, that hosts some websites, they are not supposed to send requests to other websites, no? |
They aren't supposed to spam indeed, but my point is that websites of the shared host are not aware that other users of the servers are doing that, and can be blocked as a collateral damage (in the case where they send actual referrers to spammed websites). All in all, the IP address isn't 100% reliable. It's the same problem for blocking e.g. gamers online, or when blacklisting from connecting to SSH, etc… People/servers can also be in a sub-network and share the same external IP address (companies, universities, etc.). |
What I meant is that even if a good website is behind the same IP as a spammer, if you block that IP on your server to protect yourself from the spam, the good website is unaffected, because it is not sending HTTP requests anyway (only serving it, and not to your server). By the way "blocking" the IP is the list's user choice, we are only talking about adding domains that are obviously being spammed ( |
That's not how it works in Piwik: when receiving data, Piwik will exclude any data where the referrer is blacklisted. So if a good website is in the blacklist, it will be affected because its referrer traffic (traffic going from the good website to other websites tracked with Piwik) will be ignored. It will also affect users of Piwik as well because there will be valid traffic going through their website that will be ignored by Piwik.
There is a misunderstanding here, I'm not talking about user blocking an IP. I'm talking about the methodology you suggest to add new spammers to the blacklist. This is how you explained it:
What I'm saying is that if we add spammers to the blacklist like this, we might blacklist good websites. That would be hurting both good websites and Piwik users. Example:
We detect badwebsite.com and blacklist it. We see that badwebsite.com comes from IP 1.2.3.4, and we see that referrer goodwebsite.com too. With your idea we would blacklist goodwebsite.com. Am I understanding it right? |
We are on the same page on that, we should not add good domains to the blacklist.
This is where you lost me. Traffic never goes from website to website.
An example is a good idea :)
When we say that a website "sends referal traffic" to another website, there is never any direct communication between the two servers. What actually happens is the following (I reuse your example):
Now if in the same time, the spammer with the same IP as goodwebsite.com (1.2.3.4) sends HTTP requests to myprettyponey.com with "Referer: pornvidzlolwut.ru", what will happen is that we will block the IP 1.2.3.4. |
We are still not talking about the same thing :) We understand each other on how HTTP referrer works, and again I am not talking about blocking an IP address. I am talking about adding domains to the blacklist based on their IP addresses. In other words:
You suggest we judge wether a referrer is a spam based on the IP address of the client. But the IP address of the client could be shared for many reasons. Here is another example:
In your logs, you will see:
If we follow your methodology:
|
Right, in that case there is a conflict, but if a website is hosted on 1.2.3.4, it is unaffected. If a university or similar can't secure it's own network and outgoing traffic, I see no problem to block traffic from it. Anyway the false positive scenario you describe is possible but very unlikely. We all know the domains mentioned above are spam. The increase in spam I see makes no doubt that this is a large scale operation. Soon your Piwik users will wonder why their sites are becoming so popular in Russia ;) |
For the record I've created a tag waiting confirmation and tagged issues and pull requests. |
I think blocking IP addresses, or other sites that share IPs with a known spammer are a bad idea. You'll get tons of legitimate domains that are false positives because they just happen to be on the same shared host (such as Godaddy for example) as a spammer. I also don't see this list as being a real-time instant update, so automated pull requests or additions to the list a no-go. This list needs to be added & vetted by other administrators. I don't mind if it takes a couple of days for a new domain to be formally added, that wont adversely affect the weekly, monthly, & yearly stats. |
This is not what I proposed. Websites on shared hosts do not send requests and are not concerned. For the false positives concern: I added 44 domains since I started my fork 11 days ago, and you can check for yourself, they are all spam, 100% guaranteed 😄 |
We are currently doing peer review for merging pull requests and it works well, let's close this issue! |
Brilliant idea guys.
What are the requirements for adding a bad referrer to the list? As @mnapoli mentioned in another thread - don't want to make it too broad.
I'm thinking of a process where new referral spammers are added to the list by peer review. Possibly by having other members with significantly large Piwik/Snowplow data sets to vouch for them.
The text was updated successfully, but these errors were encountered: