The Adblock Project Forum Index The Adblock Project
Pull up a seat ...stay a while.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Algorithm for pattern matching - developers please read

 
Post new topic   Reply to topic    The Adblock Project Forum Index -> Main
View previous topic :: View next topic  
Author Message
idofeld
Guest





PostPosted: Mon Sep 13, 2004    Post subject: Algorithm for pattern matching - developers please read Reply with quote

Hi,

I have not read the source code of adblock - so please let me know if what I'm about to suggest is already implemented.

I would like to know how the pattern matching is performed. Is every object being compared against any of the filters serially?

The main issue with such approach is that the runnning time is linear with respect to the size of the filters - which might slow things down when using a large number of filters.

What I suggest to do is to create a deterministic minimal finite state machine from the group of filters. This should be done only once (when the filters list is complete). Code for such algorithms (which are quite simple) is widely available, and it has been used vastly in compilers.

After that, comparing an object to the entire list is done in constant time. This could speed up adblock a lot - and make it work fast even with hugh number of filters.


Please let me know what you think.


Ido
Back to top
Guest






PostPosted: Mon Sep 13, 2004    Post subject: Reply with quote

*yawn* someone trying to impress others with what they just learned in their CIS classes?
Back to top
idofeld
Guest





PostPosted: Mon Sep 13, 2004    Post subject: I will disregard the mocking, yet... Reply with quote

Following is the piece of code that compares the patterns - The core of adblock. This loops runs in time n*k - where n is the number of patterns and k is the length of the url.

The algorithm I suggested runs in time k - simply the size of the url. This means that even if the number of patterns is large, the running time will not change.


The same code can be also be used for whitelist checking. Please consider implementing it.



Code:

// This method checks if a picture matches a filter.
// It returns true if the imageurl is caught by one of the filters.
// It returns false if the imageurl is not caught by any filter.
function validate(imageurl) {
   for (var i = 0 ; i < Patterns.length ; i++) {
      if (Patterns[i].test(imageurl))
         return true;
   }   

   return false;
}
Back to top
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Mon Sep 13, 2004    Post subject: Reply with quote

idofeld:
Why not link a few examples of the code you're envisioning?
.
Per Guest's remark, which course are you taking?
Back to top
View user's profile Send private message
Guest






PostPosted: Tue Sep 14, 2004    Post subject: Reply with quote

A very good open-source FSM library is Grail: http://www.csd.uwo.ca/research/grail/

The main idea is to create a FSM from the set of patterns (One time preprocessing).


Now, when an url should be checked - it is checked against the created FSM instead of against all the patterns.



Regarding the course, if you must know: http://webcourse.cs.technion.ac.il/234117/Spring2004/en/staff.html
Back to top
Org



Joined: 23 Oct 2003
Posts: 349

PostPosted: Wed Sep 15, 2004    Post subject: Reply with quote

Grail is written in C++. Compiled binary modules don't fit too well to the platform indepedant nature of Firefox extensions.

Call me pessimist, but I suspect that porting Grail to JS would probably make it slow and difficult to debug and update.

Or would you rather distrubute a Grail library for all possible platforms?
Back to top
View user's profile Send private message
idofeld
Guest





PostPosted: Wed Sep 15, 2004    Post subject: Reply with quote

Grail was just a code example. I haven't looked for JS code.

Anyhow, the algorithm of turning set of RegExp to FSM is well known, and implementing it in JS by a skilled person should not take that long.

I can point you to a book where this algorithm is well explained - http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-2184022-7195933?v=glance.


Enjoy.
Back to top
idofeld
Guest





PostPosted: Thu Sep 16, 2004    Post subject: Reply with quote

You know what - I don't think I'll mind coding it myself. It might be a nice primer on JS.

Who can I talk with regarding developing such features?


Ido
Back to top
Guest
Guest





PostPosted: Thu Sep 16, 2004    Post subject: Better pattern matching algorithms Reply with quote

There are better multi-pattern matching algorithms available. See this book for instance.
Back to top
Guest






PostPosted: Fri Sep 17, 2004    Post subject: Reply with quote

This has beend discussed before: e.g. here
I'd say that a few tests should be made to figure out how large the performance impact really is. But if a function for matching against multiple regexps is getting implemented it should be done as A XPCOM module from the start since this has many uses outside of adblock too.
Back to top
ChrisQ
Guest





PostPosted: Fri Sep 17, 2004    Post subject: Please note that Grail+ is not free. Reply with quote

Though adblock is not commercial I would certainly get permission before using Grail. Without knowing there terms it is hard to know whether using it would limit use of adblock on browsers in commercial organisations, etc.

Please note that Grail+ is not free.

We don't charge scholars, students, or researchers for the use of Grail+, and we don't charge people who simply want to play with it to satisfy their own curiosity.

But no commercial use of Grail+ is permitted without our prior, express, written consent. No part of Grail+ may be included in a commercial product or used on a commercial problem without our prior, express, written consent.
Back to top
idofeld
Guest





PostPosted: Fri Sep 17, 2004    Post subject: Another thing that crossed my mind Reply with quote

I was wondering how is the regular expression matching imlemented in javascript.

Maybe it will be faster to check for "RegExp1" or "RegExp2" or ... or "RegExpN" (in one big pattern) instead of checking n time against 1 pattern?

Can someone clarify it?
Back to top
wonkothesane
The Other Developer


Joined: 22 May 2004
Posts: 210

PostPosted: Wed Sep 22, 2004    Post subject: Re: Another thing that crossed my mind Reply with quote

idofeld wrote:
I was wondering how is the regular expression matching imlemented in javascript.

Maybe it will be faster to check for "RegExp1" or "RegExp2" or ... or "RegExpN" (in one big pattern) instead of checking n time against 1 pattern?

Can someone clarify it?


Ah, now you're on the right track. Regular expressions are nothing more than a way of representing a finite-state machine in text. Creating a new RegExp() object causes the RE to be compiled into a form that is analagous to a FSM, so the algorithm you were looking for earlier is essentially var re = new RegExp(reString); ... not difficult at all Wink

Very simplistic testing will show that in the SpiderMonkey implementation of JavaScript, raw-string-based (String.indexOf(s);) and RE-based (regExp.test(s);) string comparison have virtual identical time requirements, as long as the RE is not re-created on every iteration of the loop. The time taken for var re = new RegExp(reString) is roughly equivalent to the time taken for calling re.test(s);

There are a number of optimizations that could concievably be made in Adblock, including creating a "superfilter" (which I actually suggested to rue in mid-August), but in general, they give too little performance for too much effort and complexity, and so haven't and won't be implemented in the near future. With a superfilter, the problem is that to know which filter actually is responsible for blocking (e.g. for the Adblock-able Items dialog) you need to use the slower algorithm anyways, so the superfilter really only serves to speed up the cases when the URL won't be blocked by any of the filters.

If you have so many filters that it's causing a noticeable delay in page-rendering, you're probably not using regular expressions correctly. Anything under two dozen filters should be more than enough for most people, giving a good balance of speed and editability/clarity.

The reason your suggestion of implementing a "deterministic minimal finite state machine" met with such a resounding silence is that you essentially asked us to write a RE library in JavaScript and then distribute it in every single copy of Adblock, raising download sizes by Dijkstra-knows-how-much.

Also, the size of the url is actually not relevant for REs/FSMs, and is not a linear factor in raw-string-based code... in context to which the comparison code runs, k can be treated as 1.


As for getting involved with developing Adblock... all you need is a tool to unzip the XPI... from there, it's all XUL and JavaScript files.


"deterministic finite state machine" : "regular expression" :: "dihydrogen monoxide" : water
Back to top
View user's profile Send private message Send e-mail Visit poster's website AIM Address
Display posts from previous:   
Post new topic   Reply to topic    The Adblock Project Forum Index -> Main All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group