chester's blog

technology, travel, comics, books, math, web, software and random thoughts

Should GMail blacklist spam senders?

08 Aug 2013 | Comments

My friend FZero said today on Facebook:

Dear GMail: when I mark something as spam, you should BLOCK that email address from ever sending me anything again.

and my answer got a bit too long for Facebook, so here it is:

[TL;DR: Statistics say it would not work well for everyone, but might work for you, but Google is about large data sets, not (directly) about you, so if you want it, hack the planet!]

I’m not really up-to-date on the subject, but let’s assume the core of what GMail does behind the scenes is still a well-balanced “naive” Bayesian probabilistic approach for “spam/unsure/ham” classification (possibly fine-tuned by Google’s secret sauce of massive data analytics).

I remember (but can’t find right now) follow-ups to Paul Graham’s original work (which kickstarted the technique from a statistical indicator into a “99%+” efficient spam filtering technique) pointing to data/reproducible experiments suggesting that most “field knowledge” applied to the data sets (like, for example, bumping the significance of the From email address) of specific email tokens did not improve the efficiency metrics, and in several cases did indeed decrease efficiency.

However, a core point was that the inclusion of headers both in the both the database scoring and the composed score for each message (along with careful tuning of token identification to improve token database hit ratio) *did* improve the efficiency of the classification, so there *may* be some gain into applying unequivocal domain knowledge into the *classification* (i.e., doing exaclty what you say), as long as it doesn’t update the token scores with such blacklisted e-mail bodies (to avoid aforementioned performance decreases) *and* the blacklisted-to-normal spam messages frequency is low enough that their removal from the process does not decrease significantly the corpus of analyzed messages.

To add (yet) more opinionated guesswork, I’d say that Google products in general tend to lean more towards deriving behavior from large data set analysis rather than gut feeling, so I think it is very unlikely they will consider the suggestion. Would not surprise me, however, if they haven’t already A/B tested that (and every other algorithmic variation under the sun).

But I’ll finish this random rumble with at least one assertive comment: you could do it on your own by creating a blacklist filter – in fact, maybe a browser plugin (or something talking to the API? never looked it to omuch) could trick the spam button into also adding the sender to such a filter. Looks like an interesting hack to try…