View previous topic :: View next topic |
Author |
Message |
idofeld Guest
|
Posted: Mon Sep 13, 2004 Post subject: Algorithm for pattern matching - developers please read |
|
|
Hi,
I have not read the source code of adblock - so please let me know if what I'm about to suggest is already implemented.
I would like to know how the pattern matching is performed. Is every object being compared against any of the filters serially?
The main issue with such approach is that the runnning time is linear with respect to the size of the filters - which might slow things down when using a large number of filters.
What I suggest to do is to create a deterministic minimal finite state machine from the group of filters. This should be done only once (when the filters list is complete). Code for such algorithms (which are quite simple) is widely available, and it has been used vastly in compilers.
After that, comparing an object to the entire list is done in constant time. This could speed up adblock a lot - and make it work fast even with hugh number of filters.
Please let me know what you think.
Ido |
|
Back to top |
|
 |
Guest
|
Posted: Mon Sep 13, 2004 Post subject: |
|
|
*yawn* someone trying to impress others with what they just learned in their CIS classes? |
|
Back to top |
|
 |
idofeld Guest
|
Posted: Mon Sep 13, 2004 Post subject: I will disregard the mocking, yet... |
|
|
Following is the piece of code that compares the patterns - The core of adblock. This loops runs in time n*k - where n is the number of patterns and k is the length of the url.
The algorithm I suggested runs in time k - simply the size of the url. This means that even if the number of patterns is large, the running time will not change.
The same code can be also be used for whitelist checking. Please consider implementing it.
Code: |
// This method checks if a picture matches a filter.
// It returns true if the imageurl is caught by one of the filters.
// It returns false if the imageurl is not caught by any filter.
function validate(imageurl) {
for (var i = 0 ; i < Patterns.length ; i++) {
if (Patterns[i].test(imageurl))
return true;
}
return false;
} |
|
|
Back to top |
|
 |
rue Developer
Joined: 22 Oct 2003 Posts: 752
|
Posted: Mon Sep 13, 2004 Post subject: |
|
|
idofeld:
Why not link a few examples of the code you're envisioning?
.
Per Guest's remark, which course are you taking? |
|
Back to top |
|
 |
Guest
|
|
Back to top |
|
 |
Org
Joined: 23 Oct 2003 Posts: 349
|
Posted: Wed Sep 15, 2004 Post subject: |
|
|
Grail is written in C++. Compiled binary modules don't fit too well to the platform indepedant nature of Firefox extensions.
Call me pessimist, but I suspect that porting Grail to JS would probably make it slow and difficult to debug and update.
Or would you rather distrubute a Grail library for all possible platforms? |
|
Back to top |
|
 |
idofeld Guest
|
|
Back to top |
|
 |
idofeld Guest
|
Posted: Thu Sep 16, 2004 Post subject: |
|
|
You know what - I don't think I'll mind coding it myself. It might be a nice primer on JS.
Who can I talk with regarding developing such features?
Ido |
|
Back to top |
|
 |
Guest Guest
|
Posted: Thu Sep 16, 2004 Post subject: Better pattern matching algorithms |
|
|
There are better multi-pattern matching algorithms available. See this book for instance. |
|
Back to top |
|
 |
Guest
|
Posted: Fri Sep 17, 2004 Post subject: |
|
|
This has beend discussed before: e.g. here
I'd say that a few tests should be made to figure out how large the performance impact really is. But if a function for matching against multiple regexps is getting implemented it should be done as A XPCOM module from the start since this has many uses outside of adblock too. |
|
Back to top |
|
 |
ChrisQ Guest
|
Posted: Fri Sep 17, 2004 Post subject: Please note that Grail+ is not free. |
|
|
Though adblock is not commercial I would certainly get permission before using Grail. Without knowing there terms it is hard to know whether using it would limit use of adblock on browsers in commercial organisations, etc.
Please note that Grail+ is not free.
We don't charge scholars, students, or researchers for the use of Grail+, and we don't charge people who simply want to play with it to satisfy their own curiosity.
But no commercial use of Grail+ is permitted without our prior, express, written consent. No part of Grail+ may be included in a commercial product or used on a commercial problem without our prior, express, written consent. |
|
Back to top |
|
 |
idofeld Guest
|
Posted: Fri Sep 17, 2004 Post subject: Another thing that crossed my mind |
|
|
I was wondering how is the regular expression matching imlemented in javascript.
Maybe it will be faster to check for "RegExp1" or "RegExp2" or ... or "RegExpN" (in one big pattern) instead of checking n time against 1 pattern?
Can someone clarify it? |
|
Back to top |
|
 |
wonkothesane The Other Developer
Joined: 22 May 2004 Posts: 210
|
Posted: Wed Sep 22, 2004 Post subject: Re: Another thing that crossed my mind |
|
|
idofeld wrote: | I was wondering how is the regular expression matching imlemented in javascript.
Maybe it will be faster to check for "RegExp1" or "RegExp2" or ... or "RegExpN" (in one big pattern) instead of checking n time against 1 pattern?
Can someone clarify it? |
Ah, now you're on the right track. Regular expressions are nothing more than a way of representing a finite-state machine in text. Creating a new RegExp() object causes the RE to be compiled into a form that is analagous to a FSM, so the algorithm you were looking for earlier is essentially var re = new RegExp(reString); ... not difficult at all
Very simplistic testing will show that in the SpiderMonkey implementation of JavaScript, raw-string-based (String.indexOf(s);) and RE-based (regExp.test(s);) string comparison have virtual identical time requirements, as long as the RE is not re-created on every iteration of the loop. The time taken for var re = new RegExp(reString) is roughly equivalent to the time taken for calling re.test(s);
There are a number of optimizations that could concievably be made in Adblock, including creating a "superfilter" (which I actually suggested to rue in mid-August), but in general, they give too little performance for too much effort and complexity, and so haven't and won't be implemented in the near future. With a superfilter, the problem is that to know which filter actually is responsible for blocking (e.g. for the Adblock-able Items dialog) you need to use the slower algorithm anyways, so the superfilter really only serves to speed up the cases when the URL won't be blocked by any of the filters.
If you have so many filters that it's causing a noticeable delay in page-rendering, you're probably not using regular expressions correctly. Anything under two dozen filters should be more than enough for most people, giving a good balance of speed and editability/clarity.
The reason your suggestion of implementing a "deterministic minimal finite state machine" met with such a resounding silence is that you essentially asked us to write a RE library in JavaScript and then distribute it in every single copy of Adblock, raising download sizes by Dijkstra-knows-how-much.
Also, the size of the url is actually not relevant for REs/FSMs, and is not a linear factor in raw-string-based code... in context to which the comparison code runs, k can be treated as 1.
As for getting involved with developing Adblock... all you need is a tool to unzip the XPI... from there, it's all XUL and JavaScript files.
"deterministic finite state machine" : "regular expression" :: "dihydrogen monoxide" : water |
|
Back to top |
|
 |
|