A while back I said that I had made a breakthrough in spam filtering and that I filed a provisional patent on my new method. I’m now ready to reveal how it works. I’m calling it the Evolution Filter. You can read a detailed explanation here.
Basically, if you are talking about things that are said in ordinary email and it’s things that spammers never say, it’s good email. And if you are talking about things that only spammers say, it’s spam.
Most spam filters are based on matching things. Bayesian filters compare the message to known ham and spam. Or you are matching rules. This filter is based on NOT MATCHING. We match ham and not match spam to test for ham. We match spam and not match ham to test for spam.
The advantage to not matching is that I’m comparing to an infinite set rather than a finite set. Or comparing to the unknown rather than the known.
I’m getting really close to 100% accuracy and there’s room to improve it. This doesn’t just stop spam, it decimates it. Read my article and your see how I do it.
Also – this is NOT BAYESIAN. I know what Bayesian is. I have Bayesian with SpamAssassin. This is not the same thing. In fact – it’s not even similar.
Is this the spam filter that JCD uses?
Yes.
Marc, if your filter decimated spam, it would remove 10%.
It would put 10% of spammers out of business.
You make yourself look foolish when you say things like matching against an infinite set and not matching. Your filter is comparing what is in the e-mail to a list.
“Maybe he found some approach that happens to work really well for him in some particular situation and got massively over-excited because of it. ”
A network security expert who I asked about Marc’s previous post.
In fact I suspect the patent will not be granted, as it is either already filed, or already implemented by others.
I’m filtering 5000 domains. Kind of a wide particular situation.
I defended you when I saw his response, but I do think you are being overexcited. I don’t know what your previous success rate was, but this does look like just an evolutionary step, and indeed a subset of Bayesian.
For that matter isn’t it just filter out emails that contain entries in (x is an element of set spam and x is not an element of set ham)? With adjustments to change the sets ham and spam as more e-mails come in?
It’s not bayesian. Not even close.
A traditional spam filter using Bayesian or hard coded rules about “Russian Brides” might determine that only 1 out of 500 emails mentioning the phrase “Russian Brides” is a good email.
Your filter is a Bayesian, but does not have Russian brides set to 1/500, instead it is left out of the list.
Instead it assigns spam probabilities of 100 to
“Meet hot”
“Meet hot Russian”
“Meet hot Russian Brides”
“hot Russian Brides Online!”
“Russian Brides Online!”
“Brides Online!”
“Online!”
and ham probabilities of 100 to
“I read an article”
“read an article”
“read an article about”
“about Russian”
“an article about”
“in a magazine”
“Brides in a”
Followup after seeing your wiki page from the same expert:
‘ That guy isn’t “over-excited,” he’s delusional. Not only has he not invented anything significant, I remember discussing the very plan he wants to use with people years ago. During discussions where people pointed out the obvious problems that make his approach practically worthless on an industry scale due to glaring vulnerabilities the current popular approaches were designed to counteract.
I wish he would seek professional help. Preferably of a medical sort, but if not that, then perhaps that of somebody who actually understands the field he’s claiming to revolutionize.’
I’d say run this past Steve Gibson first..
if he doesnt laugh at it you might have something..
Congrats Marc.
I read your link as far as I could before my eyes glazed over.
So, what you’re affirming is that a spamster can read your description and NOT thereafter craft a breakfast of spam? I can see that if your filter only searches the header or address or OP. I can accept that spam is spam and ultimately spam can be filtered out….. in some way.
I am reminded of the constant BS pop up I get that address a subject I’m interested in but when you click on the video (“be sure your sound is turned on”) …… so far…. they never get to the fricking subject. Always talking about what they are going to talk about. I never have had the patience to watch a whole presentation. so much BS.
Oh well. Again–congrats. Competency on display is always a beautiful thing….even when I don’t understand it.
“We now describe the effects of spam campaigns on the Bitcoin network|especially
on users who send non-spam transactions, as well as the miners. For the users,
we measure the change in transaction fees and transaction delays (i.e. the time
between when we first observe a transaction in the Mempool and when the
transaction is committed to the blockchain). A large amount of spam is likely
to increase the backlog of unconfirmed transactions. As a result, transactions
are delayed for longer time periods. With more intense competition, senders pay
higher fees, in the hope that their transactions will be included in blocks sooner.
For the miners, we measure the corresponding increase in the block reward.”
http://fc16.ifca.ai/bitcoin/papers/BHMW16.pdf
If they can use spam to slow transactions and reduce performance they can charge higher fees. Two bit pirates and racketeers at work.
Nearly everything that matters is a side effect.
http://maradydd.livejournal.com/528043.html
Marketing? Claim it will send spam to hell!
KIRKUS REVIEW
Sleator devotes his considerable talents to a horror story this time. High-school student Nick and his mother live in near-poverty. Because his mother closely monitors use of their home phone, Nick buys a cell phone to talk more frequently to his girlfriend. But this phone brings weird and threatening calls and proves to have a direct connection to Hell. Nick’s life changes—and not for the better. The pitiful and self-pitying Nick, with his limited experience and lack of worldly knowledge, makes a great pawn for the predatory adults he meets when using the cell phone. The many unpleasant characters and the need for a big-time suspension of disbelief (a direct connection to where?) are countered by a dark, involving and fast-moving plot that surprises, shocks and—eventually—terrifies.
https://kirkusreviews.com/book-reviews/william-sleator/hell-phone/
Have Hells Bells ringtones.
I guess we have another idiot spamming the blog again……..
“My plan is to make it free to most everyone and charge a reasonable license fee to the big providers and my competitors.”
Fantastic! Exactly the right way to market this. “a reasonable license fee” to Google, Microsoft, Yahoo, etc. will make you very comfortable.
But why couldn’t they just rip him off? How would Marc ever know they are using it?
Send them ham or spam of course. 😉
Generally there will be a lot of ripping off. I just want to collect from some of them.
Build in a “back door”. When emails with a certain characteristic are encountered, it 100% yields a false positive or false negative contrary to expectations. Only you know what will cause this to occur. You figure out the other details. Then you trap anyone stealing your code, but not anyone who rewrites it.
You don’t need Perkel’s code to implement this.
To:NFS
That’s great. I’m sure Marc gets that too. It’s not my area of expertise.
I think you are right.
Remember Microsoft got a huge boost by giving away MS-DOS. Many bought generic PC’s and installed MS-DOS from a disk they got from a buddy. WORD also.
That generated more demand for DOS and WORD in the corporation. Sure Microsoft didn’t make a dime on the small guy, they made it up on sales to the corporate world.
What does your filter do if it finds no tokens in either category?
Or if it finds tokens in both spam-ham and ham-spam?
Tokens that don’t match on either side or match on both sides are ignored. What I’m interested is tokens matching on one side and not matching on the other. But each email generates hundreds of fingerprints and most messages usually match one side or the other.
I assume if there are no matches, you get sent to ham. What happens if you match both spam-ham and ham-spam? Do you count to see which is more?
Matching neither or both don’t count. If there’s no score I have other tests. If there’s no reason to block it then I pass it as undetermined.
So couldn’t spammers then just start trying to insert a single ham entry into the e-mail, nullifying your filter(or this portion of it)? True they would have to change the entry with each message.
The problem that I see here is that spammers can easily modify their subject lines to bypass the logic behind this filter… Or am I missing something?
Yes, what I suggested last week when Marc first suggested it. However the filter will change as well so it is not that easy. This is evolutionary, not revolutionary.
But if spammers make their subjects look like any legitimate message, evolution to detect is going to be difficult, if not possible…
They would still have to write their actual spam message. So they would end up back in the lists.