The Adblock Project Forum Index The Adblock Project
Pull up a seat ...stay a while.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

A thought about adblock using Bayesian probability analysis

 
Post new topic   Reply to topic    The Adblock Project Forum Index -> Main
View previous topic :: View next topic  
Author Message
McMurmel



Joined: 13 Nov 2003
Posts: 19
Location: Germany

PostPosted: Sun Nov 16, 2003    Post subject: A thought about adblock using Bayesian probability analysis Reply with quote

Bayesian filtering with e-mails works because it takes the content of a mail, splits it into words and saves these words in a good and a bad hash-tree.
The problem with ads is: we don't want to download the content, so we can't analyse it. If we download the content, there's no known way to split image-data into 'words'.
What we can do is analyse the path of an image. One problem with the path is, it's much shorter than a general e-mail, consisting only of about 5 to 10 words maximum. The next problem is, the relation between the ad and its path is very weak so it could and will happen that an ad has a 'neutral' path.
With the current method adblock uses the enduser has to take the path and create a filter-criterion by himself. With bayesian filtering he can only decide by the image of the ad if it is an ad or not. This will reduce the efficiency of this method because of ads with 'neutral'. I know many ads that have a 'neutral' path, because they are local copies or the web-designer knew how to conceil his ads.
So the question is: How big is the impact? My guess is (and I' very positive about it) that efficiency will drop below current rates.
So from my pov this is a nice idea but not really an option will gain much or anything at all.
_________________
The road goes ever on and on - down from the door that it began - now far ahead the road has gone and I must follow if I can...
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Sun Nov 16, 2003    Post subject: Reply with quote

Stefan:
Those words -- or more accurately "tokens" -- can be parsed out of anything. Their statistical relation to one-another is what's analyzed, not so much their content. Breaking a url into tokens between forward-slashes is an easy one.
.
Many parameters can be obtained from unloaded-content: all aspects of the source, all attributes, and even semantic-relations to surrounding hierarchy: "any image located in a table this size".
.
At the extreme, it could even remove raw-text from a page.
.
Since I haven't actually coded this yet, any talk of benchmarks is premature. The overarching ruleset can be limited to minimize penalty; on faster machines, it should be hardly noticeable at all. Adblock will probably always support manually-entered filters. But, an automated method would exponentially nicer.
Back to top
View user's profile Send private message
McMurmel



Joined: 13 Nov 2003
Posts: 19
Location: Germany

PostPosted: Sun Nov 16, 2003    Post subject: Reply with quote

Sorry if I was not clear with impact. I'm but german and my english is bullshit. With impact I mean the effeciency with which adblock would detect ads.

The main question is: Will bayesian analysis make ad-detection better if we use it in adblock?

You may collect data from anywhere out of the loaded page, but the question is how likely is this data to reflect a relation to an ad? If you look at it, best data would be the imagedata itself, it's data directly from the ad we try to block. We can't use this so any other collected data is but 2nd or less class data. And so is the effenciency in the end I think. The method is directly related to the data used so if you use data from the environoment we only get statistics about the environoment of and not the ad itself. It's a blurred try and so will the results be... blurred ... or less good. The question here still: How bad will things be? With e-mail hitrate is above 95% with nearly no wrong hits. At least statistics about wrong hits will increase that's shure...

The problem of coding it is another story. Some questions that come to my mind about javascript:

- How does javascript handle the following: var x = MyObject ?
- Does it save only a reference in x or a complete 2nd copy of MyObject?
- If the latter is the case it would prevent the use of binary-trees. And sorting would get cumbersome too because we have to swap complete objects and not only references...

It's just one trial and error with var x = MyObject. I didn't check that out so far. But it's always good to think before you act.The rest is but a question of studying 'A Plan for Spam' and coding it...
_________________
The road goes ever on and on - down from the door that it began - now far ahead the road has gone and I must follow if I can...
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Sun Nov 16, 2003    Post subject: Reply with quote

Stefan:
how likely is this data to reflect a relation to an ad?
[-snip-]
best data would be the imagedata itself, it's data directly from the ad we try to block. We can't use this so any other collected data is but 2nd or less class data.

I don't understand your logic at all.. unless you consider the current incarnations of Adblock (both official and yours) to be highly inferior. The regular-expressions we already come up with can be "auto-generated" by analyzing what's been marked as spam. That's the most basic analysis possible: looking at the src-attribute. And since that's the way it works now, why would you call this "2nd class"?
.
Bayesian analysis for spam doesn't look at image-content. It just checks the text/html. That's all Adblock needs to do, too.
.
As for obj-handling:
var x = thatObject; // sets a reference to 'thatObject'
var x = new Object(thatObject); // copies thatObject and assigns the reference to x.
.
I don't think you can ever "directly" access anything in javascript. It's all references.
Back to top
View user's profile Send private message
McMurmel



Joined: 13 Nov 2003
Posts: 19
Location: Germany

PostPosted: Sun Nov 16, 2003    Post subject: Reply with quote

Some example: Which of the two is the ad-image? And why is it the ad?

1. http://www.adac.de/images/8_65496.gif
2. http://www.adac.de/images/8_59241.jpg

Try to generate a regular expression from it that does not filter more none-ads than ads. If you look at the images it's totally obious which one is. With only the path it's not and that's the information adblock would analyse.
A human brain may see there's no secure way to derive a general working filter from this information and stops trying. Bayesian filters would come to the conclusion that the word 'gif' tends to mark an ad-image - hopefully :). What if the user marks too many ads with such a path?

The next thing is adding the bayesian method the the actual one in addition. Which enduser would train the filter only to get 5% less ads? It's a whole lot of work in the beginning to mark ads and none-ads don't you think? Why the whole work if adblock already blocks 70-80% of ads?

The current spam-filter in mozilla looks at emails to block spam-emails. The link is direct because if we read the mail we see if it's spam.
If you port this to ad-images the way a human being recognizes ads is by looking at the image not at the imagepath. So the link is indirect and somewhat weaker, don't you think so? It is! Because it's 'only a good guess' to think that the image-path always defines the type of image.

Don't take this all too personally. I'm just thinking about this and I am full of doubts and too lazy to just try it out. Nobody will be harmed if you think it out before you try it. Sometimes thinking shows that work - you otherwise would have done already - is useless.
_________________
The road goes ever on and on - down from the door that it began - now far ahead the road has gone and I must follow if I can...
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Sun Nov 16, 2003    Post subject: Reply with quote

Stefan:
I've read a few papers on bayesian metrics, so I'm not totally unaware how the results get ranked.
.
It works on reinforced relationships. That reinforcement comes from the enduser. If you tag multiple images from "adac.de" which end in ".gif", and this is the only relationship they share, then yes: all gifs from this domain will be assigned a higher probability-ranking for spam. By setting the probability-threshold, you determine how much reinforcement is necessary before a particular relationship is blocked.
.
It's quite simple, really. You could even set the threshold for 1:1 and never block anything you didn't explicitly define: in other words, the way Adblock currently works.
.
Oh, and continuing your example: if they were both ".jpeg", then Adblock simply couldn't catch them. As they stand, the relationship would be weak, resulting in a low probability-ranking. You'd need to reinforce it.
Back to top
View user's profile Send private message
McMurmel



Joined: 13 Nov 2003
Posts: 19
Location: Germany

PostPosted: Sun Nov 16, 2003    Post subject: Reply with quote

Thats the thing I fear. If you've got a long path with lets say 10 words and only 1 word has a probability showing it could be an ad it will shurly get outweighted by the other 9 that are more or less neutral.
The threshhold for spam-filtering in mozilla-mail is > 0.9 but if ads only differ from general images in 1 word out of 10 they will never be recognised as what they are with this setting. That's for shure one point where you have to tweak things - or better still - let the enduser tweak. Nice idea...
_________________
The road goes ever on and on - down from the door that it began - now far ahead the road has gone and I must follow if I can...
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Mon Nov 17, 2003    Post subject: Reply with quote

Stefan:
Neutral words don't affect the scoring. Not unless you want them to.
.
.
However, even if they did, the situation is easily countered. One idea: you could sandbox the ads the user "tags" to the domain of whatever page he's visiting -- only weighing probabilities based on frequency-averaging within that domain. Anything blocked would become part of the domain's "set".
.
Then, when the machine is idle, Adblock would look for relationships across domain sets, and create some global sets.
.
Rather than playing averages, the global sets would add a fixed-value to any matching item's rank -- a kind of 1:1 matching. In other words, you'd cull real filters from statistics.
Back to top
View user's profile Send private message
McMurmel



Joined: 13 Nov 2003
Posts: 19
Location: Germany

PostPosted: Mon Nov 17, 2003    Post subject: Reply with quote

Sorry, but I will not follow this any further. I think you start iterating the original problem. It will gain nothing but a make the program more complex and the handling, too. You try making the results better by raising the complexity of the evalution. The only way to make things better is to take better data.
Some example (perhaps a bad one):
We try to recognise a car by it's license plate. No matter how much info we collect about the license plate and how complex it is to gather the info - it still will happen that 10+ cars of the same brand have different license plates. But the thing we need to know is the brand of the car.

Thanks for your support, rue. I think I better try it out in a very basic way by myself or wait for you to code it. No more thoughts about it in my skull...
:-D
_________________
The road goes ever on and on - down from the door that it began - now far ahead the road has gone and I must follow if I can...
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Mon Nov 17, 2003    Post subject: Reply with quote

Stefan:
You and I are approaching this problem with different goals, that's all.
.
The original idea that led me to Bayesian was: automating the creation of regular-expressions. After reading a few papers on it, I realized we could bypass the expressions entirely, and do "live analysis" on the collected data -- Bayesian.
.
What you're trying to solve for is different: visual-cognition with respect to image-data. I don't understand why you keep coming back to this, since it's vastly beyond the resources a web-browser should employ. Maybe in the year 2015, it would be nice :P
Back to top
View user's profile Send private message
McMurmel



Joined: 13 Nov 2003
Posts: 19
Location: Germany

PostPosted: Mon Nov 17, 2003    Post subject: Reply with quote

Quote:

What you're trying to solve for is different: visual-cognition with respect to image-data. I don't understand why you keep coming back to this, since it's vastly beyond the resources a web-browser should employ. Maybe in the year 2015, it would be nice :P

I don't want it... it's needed to create statistics about ads. If we collect data about paths. We get statistics about paths. It describes types of pathes not types of images. That's what I tried to point out. If you want to describe a person you can't collect data about his clothes or the size of his nose and say it's his personality you describe. The data is superficial and so the description is 'superficial'. It's the same with bayesian filtering. Our data we try to collect is superficial and so the results will be. The question is: Will it be too superficial to block adimages successfully? I don't think so, but this try will surely be too superficial to beat the current try.

And as in matrix revolutions: Everything that has a beginning must have an end. If you're still not satisfied goto the first post way up. :P
_________________
The road goes ever on and on - down from the door that it began - now far ahead the road has gone and I must follow if I can...
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Mon Nov 17, 2003    Post subject: Reply with quote

Will it be too superficial to block adimages successfully? I don't think so, but [Bayesian] will surely be too superficial to beat the current try.

Bayesian is the current try -- just in a higher form. Its algorithms effectively create filters for the user. The fact that these filters are mutable via statistics doesn't change what they do: pattern-matching pieces of text-data.
.
Statistical patterns are a superset of regular-expressions. So, they're equally "superficial". For some reason, you think I'm arguing against cognitive-blocking; and I'm not. I'm just refining the current method, using Bayesian analysis.
Back to top
View user's profile Send private message
McMurmel



Joined: 13 Nov 2003
Posts: 19
Location: Germany

PostPosted: Mon Nov 17, 2003    Post subject: Reply with quote

Let me tell you straight: I think as profound as I am able to code. What you think is way beyond my coding-skills. My simple thought was just a plain adaption from spam-filtering mails to ad-filtering images. No refinement no trweaks. All my concerns based upon this simple thing. You seem to be sure that this basic adaption will work. Not only are you sure about that, you're also shure that you'll be able to code what your current ideas are. If that'
s the case - you seem very determined - the only thing left for me is to wait for this new version and learn from how it works.
_________________
The road goes ever on and on - down from the door that it began - now far ahead the road has gone and I must follow if I can...
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Mon Nov 17, 2003    Post subject: Reply with quote

Stefan:
Nah- your skills are fine. I'd wager any Bayesian-component you'd come up with would be as good as mine.
.
In fact, I hadn't even begun to consider the Bayesian stuff yet -- this thread brought out quite a few ideas. Before we get there, a few minor features need to land: rdf filter-file storage, in-memory rdf filter-pool (for relational access), whitelisting, and filter "last access" date-stamping (to autoprune unused filters). They need to land pretty much in that order.
.
If you'd like to contribute to any of these areas, I'd greatly appreciate it. If you'd prefer to stick to your version, that's o.k. too. But, for what it's worth: I've quite enjoyed our discourse, thus far.
Back to top
View user's profile Send private message
McMurmel



Joined: 13 Nov 2003
Posts: 19
Location: Germany

PostPosted: Tue Nov 18, 2003    Post subject: Reply with quote

LOL

There's is only 3 reasons why I should toture my brain with this stuff.

1. I get money for it.

2. I think it's woth it.

3. I must be totally gone crazy.

I think the 3rd reason is the reason a could accept. :D
The rest you need sounds pretty much like MS SQL-Server, Oracle, DB2 or at least this free rational database ... I forgot it's name. The one that still can't handle sub-selects... ;)
_________________
The road goes ever on and on - down from the door that it began - now far ahead the road has gone and I must follow if I can...
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Tue Nov 18, 2003    Post subject: Reply with quote

4. I'm learning.

5. I'm an altruist.

That would pretty much sum it up :P
.
As for implementing a 3rd-party database: I'd rather see what we can roll on our own. <-- Make sure you're not drinking anything when you read that.
Back to top
View user's profile Send private message
Guest






PostPosted: Tue Nov 18, 2003    Post subject: alt en title attributes Reply with quote

my vision on what can be parsed through the Bayesian filter:
- alt and title attributes (human readable text)
- image url
- link target url
- target frame (might contain info about ads)
- iframes url
Back to top
Display posts from previous:   
Post new topic   Reply to topic    The Adblock Project Forum Index -> Main All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group