chester's blog

technology, travel, comics, books, math, web, software and random thoughts

Should GMail blacklist spam senders?

08 Aug 2013

My friend FZero said today on Facebook:

Dear GMail: when I mark something as spam, you should BLOCK that email address from ever sending me anything again.

and my answer got a bit too long for Facebook, so here it is:

[TL;DR: Statistics say it would not work well for everyone, but might work for you, but Google is about large data sets, not (directly) about you, so if you want it, hack the planet!]

I’m not really up-to-date on the subject, but let’s assume the core of what GMail does behind the scenes is still a well-balanced “naive” Bayesian probabilistic approach for “spam/unsure/ham” classification (possibly fine-tuned by Google’s secret sauce of massive data analytics).

I remember (but can’t find right now) follow-ups to Paul Graham’s original work (which kickstarted the technique from a statistical indicator into a “99%+” efficient spam filtering technique) pointing to data/reproducible experiments suggesting that most “field knowledge” applied to the data sets (like, for example, bumping the significance of the From email address) of specific email tokens did not improve the efficiency metrics, and in several cases did indeed decrease efficiency.

However, a core point was that the inclusion of headers both in the both the database scoring and the composed score for each message (along with careful tuning of token identification to improve token database hit ratio) *did* improve the efficiency of the classification, so there *may* be some gain into applying unequivocal domain knowledge into the *classification* (i.e., doing exaclty what you say), as long as it doesn’t update the token scores with such blacklisted e-mail bodies (to avoid aforementioned performance decreases) *and* the blacklisted-to-normal spam messages frequency is low enough that their removal from the process does not decrease significantly the corpus of analyzed messages.

To add (yet) more opinionated guesswork, I’d say that Google products in general tend to lean more towards deriving behavior from large data set analysis rather than gut feeling, so I think it is very unlikely they will consider the suggestion. Would not surprise me, however, if they haven’t already A/B tested that (and every other algorithmic variation under the sun).

But I’ll finish this random rumble with at least one assertive comment: you could do it on your own by creating a blacklist filter – in fact, maybe a browser plugin (or something talking to the API? never looked it to omuch) could trick the spam button into also adding the sender to such a filter. Looks like an interesting hack to try…

Comments


lisias

No by default.

But it should publish the top 10 (or about) offenders, and ask us if we wanna block them.

Moreover, this information should be available to other MTA too.

Why just the top 10 (or about) offenses?

To avoid blacklisting forever someone. As soon as the source stops spamming, it's not a offender anymore. No need to keep punishing him. Put the ex-bastard into a gray-list for some time, and if no further issue arises, just forget about him.

This mechanism will, also, promotes a run to avoiding be in the "top 10". Spammers will have to monitor the list and triming his bots to avoid hitting the limit.

As the times goes by, the absolute number of offenses in the top 10 will decline, and the spammers will have more and more trouble to keep his counting below the radar.

This will not eliminate the SPAM (what is impossible anyway - unless you think the Internet should be regulated), but will keep it in manageable levels.

Please note that that "top 10" list is not about offenders, but "offenses". Take the 10th most common spam being sent, and any MTA that releases any message repeatedly at that volume or more, goes to the black list.

I think that only Google has the horsepower to to such thing.

chester

Yeah, if we look at non-bayesian (and non-individual) measures, this one seems sensible (you could even add "age" as a weighted component of the offenses, so older ones can naturally be replaced by the latest plague).

But I'd doubt Google (or any other large organization) would go into the legal hot waters of publishing such a list - they already got enough trouble with net neutrality questionings in more "mundane" things such as inter-networking deals with providers to speed up YouTube: http://arstechnica.com/info...