 |
The Adblock Project Pull up a seat ...stay a while.
|
View previous topic :: View next topic |
Author |
Message |
mcm_ham

Joined: 17 Dec 2004 Posts: 310
|
Posted: Sun Feb 27, 2005 Post subject: Adblock RegExp Speed Question! |
|
|
Ok, I know that you can speed up the filter list by grouping words that start the same like so:
.ad.
.adserver.
.adimg.
/\Wad(server|img)?\W/
But I also know that grouping together words that end the same does not:
.blogad.
.getad.
.onlinead.
/\W(blog|get|online)ad\W/
This is because it starts the pattern matching from the beginning of the url. And since it has to initially test the three or cases this is really just the same as having three different filters. However, in trying to understand the regexp documentation it seems possible to start pattern matching from the end. In which case the above filter will bring about a speed increase. I gather this is done like this:
/\W(blog|get|online)ad\W/g
Where 'g' at the end represents the last index flag. However, AdBlock I've been told automatically handles case sensitive issues by making filters case insensitive and the '/' character appearing in the middle of a filter by escaping like this "\/". Of course if it was intelligent it would only escape the '/' characters that appear between the outer two. If it does escape the last one because there is a character after it then the flag wouldn't work.
So my question is, have I understood the situation correctly and is there a way to start pattern matching from the end for speed improvements?
Thanks for any help. |
|
Back to top |
|
 |
Ken Cooper
Joined: 29 Dec 2004 Posts: 110 Location: Holland, MI USA
|
Posted: Mon Feb 28, 2005 Post subject: Re: Adblock RegExp Speed Question! |
|
|
mcm_ham wrote: |
But I also know that grouping together words that end the same does not:
.blogad.
.getad.
.onlinead.
/\W(blog|get|online)ad\W/
|
How do you know this?
mcm_ham wrote: |
This is because it starts the pattern matching from the beginning of the url. And since it has to initially test the three or cases this is really just the same as having three different filters.
|
I don't believe this to be true, so I will defer to someone who does know for sure.
mcm_ham wrote: |
However, in trying to understand the regexp documentation it seems possible to start pattern matching from the end. In which case the above filter will bring about a speed increase. I gather this is done like this:
/\W(blog|get|online)ad\W/g
|
The 'g' flag is global, and says to continue after the initial find, which Adblock does by default anyway.
mcm_ham wrote: |
However, AdBlock I've been told automatically handles case sensitive issues by making filters case insensitive [...]
|
True.
mcm_ham wrote: |
[...] and the '/' character appearing in the middle of a filter by escaping like this "\/".
|
True.
mcm_ham wrote: |
Of course if it was intelligent it would only escape the '/' characters that appear between the outer two.
|
It does.
mcm_ham wrote: |
If it does escape the last one because there is a character after it then the flag wouldn't work.
|
It doesn't
mcm_ham wrote: |
So my question is, have I understood the situation correctly and is there a way to start pattern matching from the end for speed improvements?
Thanks for any help. |
Forward or backwards it should be the same speed. _________________ My Firefox Page
NOTE: Firefox is spelled “F-i-r-e-f-o-x”; only the first letter capitalized. The preferred abbreviation is “Fx” or “fx”. |
|
Back to top |
|
 |
mcm_ham

Joined: 17 Dec 2004 Posts: 310
|
Posted: Mon Feb 28, 2005 Post subject: |
|
|
rue wrote: | Regexp can compress "patterns" for speed. But, for unique terms, it wouldn't do much.
Ex: ad, adstuff, adworld -> slower than -> /ad(stuff|world)/
:
Ex: un, related, terms -> nearly same -> /un|related|terms/ |
rue wrote: | Anonymous wrote: | maybe I'm not understanding how adblock works but wouldn't a long list slow down page load time? |
Actually it wouldn't. As I explained, the latest builds ignore beginning / ending wildcards in Simple Filters. The list is fine. |
rue wrote: | 3. all pre-existing forward-slashes must be turned into literals. just search+replace '/' with '\/'. |
Ok, a lot of those suppositions were my own so I don't know for sure. The quotes are the best info I can find but don't help much. The third quote could be for an old version though. |
|
Back to top |
|
 |
wonkothesane The Other Developer
Joined: 22 May 2004 Posts: 210
|
Posted: Mon Feb 28, 2005 Post subject: |
|
|
You're free to look at the actual code Adblock uses for details, but this is roughly how things work:
If a filter starts with '/' and ends with '/' (e.g. matches /^\/.*\/$/) then the beginning and ending slashes are stripped off (filter.slice(1,-1) and the resulting filter is passed to the RegExp constructor to create the RegExp object to use to .test() against each URL on the page as passed to shouldLoad. The RegExp constructor automatically escapes slashes for us, which is why you can write filters like /http://example\.com/ and have it work.
However, when you add that 'g' to the end, it makes Adblock think you haven't entered a regexp, but a plain string to be matched against instead (think: /ads/*). Oh, and Ken: no, Adblock does not make RegExps global; not much point, when all we're interested in is a boolean yes-or-no decision on whether a URL matched the pattern.
Non-regexp patterns have "special" characters escaped, leading and trailing asterisks removed, and "internal" asterisks converted to ".*?"; the resulting pattern is passed to the RegExp constructor.
Before making any assumptions about speed, it would behoove you to construct simple iterative tests in the JavaScript Console to get a better handle of how much time it actually takes to call .test(). As it turns out, callling .test() on a RegExp object is just as fast as a string-based indexOf(), which certainly surprised rue and myself. In 0.5, there was a distinction between "simple" filters and regexps, because rue assumed that .indexOf() was faster than .test(); later testing shattered that assumption.
**edit**
In some cases, .test can be much (4x) faster than .indexOf. It's most noticable for short strings; when the test-string gets to ~200 characters, the methods are about equal in speed. Try copy & pasting the following two code snippets in to the JS Console. I get ~760 for .indexOf and ~188 for .test
Code: |
var st =new Date().getTime(); var i = 64000, checkRe = /favicon\.ico/; while(i--) { ('http://aasted.org/adblock/favicon.ico').indexOf('favicon.ico') != -1; } var d = new Date().getTime() ; d-st
var st =new Date().getTime(); var i = 64000, checkRe = /favicon\.ico/; while(--i) { checkRe.test('http://aasted.org/adblock/favicon.ico'); } var d = new Date().getTime() ; d-st;
|
And yes, note that on my machine (2.4 GHz P4) sixty four thousand calls to .test() takes a total of 188 milliseconds, or a fifth of a second. Virtually none of the time Adblock spends executing in shouldLoad is spent on the actual RegExp comparisons; it's all the surrounding code before and after the .test that's the bulk of execution time.
**/edit**
In any case, the length of the string being matched and the complexity of the RegExp do have some impact on execution speed, but not nearly as much as you might think. In general, if you want to make Adblock faster, condense your filterlist into as few filters as possible.
In response to your original question: no, there's no way to tell the RegExp engine what to do.
It is not necessary to escape forward slashes.
A long list will take longer to iterate through than a shorter list. Whether that translates into a noticeable slow-down to the user will depend of more factors than list length alone. |
|
Back to top |
|
 |
Ken Cooper
Joined: 29 Dec 2004 Posts: 110 Location: Holland, MI USA
|
Posted: Mon Feb 28, 2005 Post subject: |
|
|
wonkothesane wrote: | Oh, and Ken: no, Adblock does not make RegExps global; not much point, when all we're interested in is a boolean yes-or-no decision on whether a URL matched the pattern.
|
Once a match is found the same filter can be used over again, making it global?
wonkothesane wrote: |
In some cases, .test can be much (4x) faster than .indexOf. It's most noticable for short strings; when the test-string gets to ~200 characters, the methods are about equal in speed. |
If I understand you correctly, and please confirm or correct me if I am wrong, a filter string in excess of 200 characters is slow, and should be made into two or more shorter filter strings?
What would be the most efficient maximum RegEx filter string length? _________________ My Firefox Page
NOTE: Firefox is spelled “F-i-r-e-f-o-x”; only the first letter capitalized. The preferred abbreviation is “Fx” or “fx”. |
|
Back to top |
|
 |
rue Developer
Joined: 22 Oct 2003 Posts: 752
|
Posted: Mon Feb 28, 2005 Post subject: |
|
|
Ken:
Wonko was referring to the url's length, not the filter's. He also didn't offer a complete view of the performance-difference: the trend wrote: | filter1: /myShortFilter/
filter2: "myShortFilter"
url length-50: filter1 is faster
url length-176: both filters run equal
url length-500: filter2 is faster |
As for forward-slashes, Adblock v.5 was permissive in allowing them unescaped. But the JS RegExp literal syntax requires all forward-slashes be preceeded by a backslash. Since the details that permit the "lack of escape" might change at any point, it's rather advised to stay with proper syntax. |
|
Back to top |
|
 |
Ken Cooper
Joined: 29 Dec 2004 Posts: 110 Location: Holland, MI USA
|
Posted: Mon Feb 28, 2005 Post subject: |
|
|
Okay Rue, a few questions:
If you know most URL's will be <176 characters use a RegEx, otherwise use Simple?
In regards to the forward-slash, are you advocating that all forward-slashes now be escaped to comply with the literal syntax? Will 0.6 require these to be escaped, or can we keep cheating? _________________ My Firefox Page
NOTE: Firefox is spelled “F-i-r-e-f-o-x”; only the first letter capitalized. The preferred abbreviation is “Fx” or “fx”. |
|
Back to top |
|
 |
wonkothesane The Other Developer
Joined: 22 May 2004 Posts: 210
|
Posted: Mon Feb 28, 2005 Post subject: |
|
|
Ken: Being 'global' means a regexp will continue to search a string after it's found a match. For example, /\d/.test('1 plus 2 is 3.'); would (should) stop after the first character has satisfied the RE, but if that re were global (/\d/g) the whole string would be searched, which is unnecessary.
As I noted before: we're talking about a hundredth of a millisecond execution time for re.test(string). It's much faster for Adblock to test one long regexp than three short ones.
Technically, it's not cheating to not escape forward-slashes. Filter strings are not constructed with the literal syntax, which does require escaped forward slashes; instead, the string (minus leading and trailing slashes) is passed to the RegExp constructor. The above method ("details", as rue calls it) is highly unlikely to change.
In future-versions, simple filters will most-likely be transformed into RegExp -- meaning the trend rue posted will remain specific to v.5. Additional Note: it's quite rare to see URLs over 200 characters or so. |
|
Back to top |
|
 |
Ken Cooper
Joined: 29 Dec 2004 Posts: 110 Location: Holland, MI USA
|
Posted: Mon Feb 28, 2005 Post subject: |
|
|
wonkothesane wrote: | Ken: Being 'global' means a regexp will continue to search a string after it's found a match. For example, /\d/.test('1 plus 2 is 3.'); would (should) stop after the first character has satisfied the RE, but if that re were global (/\d/g) the whole string would be searched, which is unnecessary.
|
Thanks for the explanation wonkothesane. My idea of "global" did not take into consideration the continuance of the string. So therefore adding the 'g' flag in this application, would actually slow down Adblock.
wonkothesane wrote: |
It's much faster for Adblock to test one long regexp than three short ones.
|
Noted!
wonkothesane wrote: |
[...] meaning the trend rue posted will remain specific to v.5. [...]
|
Noted! _________________ My Firefox Page
NOTE: Firefox is spelled “F-i-r-e-f-o-x”; only the first letter capitalized. The preferred abbreviation is “Fx” or “fx”. |
|
Back to top |
|
 |
Paulfox Guest
|
Posted: Tue Mar 01, 2005 Post subject: Great Stuff! |
|
|
Well, this is perhaps the best thread EVER - both KenCooper and mcm_ham have taught me LOADS of stuff, and I learned more today here, because the three of us are "tightening the screws" of knowledge with this script.
Just fantastic. The not having to escape forward slashes is the reason my yahoo filter works (and well!); I may post here later with a couple of specifics myself but want to make it as specific, short and sweet as possible.
Many thanks to all here . . . AdBlock is just AMAZING and anyone connected with it should be knighted! |
|
Back to top |
|
 |
mcm_ham

Joined: 17 Dec 2004 Posts: 310
|
Posted: Tue Mar 01, 2005 Post subject: |
|
|
Thanks for all the information, indexOf being slower then RegExp suprises me too. I think you are right wonkothesane writing some simple tests in java instead of making assumptions is the way to go, either that or try and work out the Big Oh notation to see how it scales. |
|
Back to top |
|
 |
MW Guest
|
Posted: Tue Mar 08, 2005 Post subject: |
|
|
rue wrote: | Ken:
Wonko was referring to the url's length, not the filter's. He also didn't offer a complete view of the performance-difference: the trend wrote: | filter1: /myShortFilter/
filter2: "myShortFilter"
url length-50: filter1 is faster
url length-176: both filters run equal
url length-500: filter2 is faster |
As for forward-slashes, Adblock v.5 was permissive in allowing them unescaped. But the JS RegExp literal syntax requires all forward-slashes be preceeded by a backslash. Since the details that permit the "lack of escape" might change at any point, it's rather advised to stay with proper syntax. |
rue: based on the .test() .indexOf() stuff mentioned before and the 200+ url length, does this post still hold true
http://aasted.org/adblock/viewtopic.php?p=1360#1360
rue wrote: | Posted: Mon Jan 12, 2004 Post subject: Reply with quote
MW:
I know you wouldn't inherently know this, but simple-filters which include a complete host-name are matched faster than any other filter. This is because all such filters' hostnames are copied into a hash and key-matched against the element's host.
.
So http://pagead2.googlesyndication.com/ matches faster than /\/pagead2.googlesyndication.com\//. |
|
|
Back to top |
|
 |
Guest
|
Posted: Tue Mar 08, 2005 Post subject: |
|
|
wonkothesane wrote: |
Before making any assumptions about speed, it would behoove you to construct simple iterative tests in the JavaScript Console to get a better handle of how much time it actually takes to call .test(). As it turns out, callling .test() on a RegExp object is just as fast as a string-based indexOf(), which certainly surprised rue and myself. In 0.5, there was a distinction between "simple" filters and regexps, because rue assumed that .indexOf() was faster than .test(); later testing shattered that assumption.
**edit**
In some cases, .test can be much (4x) faster than .indexOf. It's most noticable for short strings; when the test-string gets to ~200 characters, the methods are about equal in speed. Try copy & pasting the following two code snippets in to the JS Console. I get ~760 for .indexOf and ~188 for .test
Code: |
var st =new Date().getTime(); var i = 64000, checkRe = /favicon\.ico/; while(i--) { ('http://aasted.org/adblock/favicon.ico').indexOf('favicon.ico') != -1; } var d = new Date().getTime() ; d-st
var st =new Date().getTime(); var i = 64000, checkRe = /favicon\.ico/; while(--i) { checkRe.test('http://aasted.org/adblock/favicon.ico'); } var d = new Date().getTime() ; d-st;
|
And yes, note that on my machine (2.4 GHz P4) sixty four thousand calls to .test() takes a total of 188 milliseconds, or a fifth of a second. Virtually none of the time Adblock spends executing in shouldLoad is spent on the actual RegExp comparisons; it's all the surrounding code before and after the .test that's the bulk of execution time.
**/edit**
| I actually get the exact opposite results on a slow processor indexOf() is QUICKER than test()
test machine PII 400
.indexOf = ~ 2690
.test = ~ 19390
lowering to 4000 calls results in
.indexOf = ~ 170
.test = ~ 1260
so the comparision test is very dependant on processor speed (duh hehe since we are calculating lol). The actual ratio increased exponentially as i increased the calls. |
|
Back to top |
|
 |
someone
Joined: 06 Mar 2005 Posts: 14
|
Posted: Tue Mar 08, 2005 Post subject: |
|
|
Maybe it could be an option on 0.6
Another suggestion is to have a little stop watch type of thing to show users how much time it is taking to block the ads. |
|
Back to top |
|
 |
wonkothesane The Other Developer
Joined: 22 May 2004 Posts: 210
|
Posted: Wed Mar 09, 2005 Post subject: |
|
|
MW: It's still true for v.5 -- that's all I'm allowed to say.
Guest: That's odd. I wonder if the relative speed difference is actually processor-based or dependent on another factor, like RAM (I've got a healthy 1 GB). Still, though: 4000 .test() calls take just over a second to complete, which means each call is using a fourth of a millisecond of processor time. Since each page is unlikely to have more than 80 calls to .test(), I think 20 ms of processor time per page on a 400 MHz Pentium II devoted to .test() isn't too exorbiant. That's rather more than most pages will generate (where calls = filters * elements).
someone: no, being able to treat all filters as regexp objects far outweighs the minor speed delta between .test and .indexOf. |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|