The Adblock Project Forum Index The Adblock Project
Pull up a seat ...stay a while.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Adblock RegExp Speed Question!

 
Post new topic   Reply to topic    The Adblock Project Forum Index -> Main
View previous topic :: View next topic  
Author Message
mcm_ham



Joined: 17 Dec 2004
Posts: 310

PostPosted: Sun Feb 27, 2005    Post subject: Adblock RegExp Speed Question! Reply with quote

Ok, I know that you can speed up the filter list by grouping words that start the same like so:

.ad.
.adserver.
.adimg.

/\Wad(server|img)?\W/

But I also know that grouping together words that end the same does not:

.blogad.
.getad.
.onlinead.

/\W(blog|get|online)ad\W/

This is because it starts the pattern matching from the beginning of the url. And since it has to initially test the three or cases this is really just the same as having three different filters. However, in trying to understand the regexp documentation it seems possible to start pattern matching from the end. In which case the above filter will bring about a speed increase. I gather this is done like this:

/\W(blog|get|online)ad\W/g

Where 'g' at the end represents the last index flag. However, AdBlock I've been told automatically handles case sensitive issues by making filters case insensitive and the '/' character appearing in the middle of a filter by escaping like this "\/". Of course if it was intelligent it would only escape the '/' characters that appear between the outer two. If it does escape the last one because there is a character after it then the flag wouldn't work.

So my question is, have I understood the situation correctly and is there a way to start pattern matching from the end for speed improvements?
Thanks for any help.
Back to top
View user's profile Send private message Visit poster's website
Ken Cooper



Joined: 29 Dec 2004
Posts: 110
Location: Holland, MI USA

PostPosted: Mon Feb 28, 2005    Post subject: Re: Adblock RegExp Speed Question! Reply with quote

mcm_ham wrote:

But I also know that grouping together words that end the same does not:

.blogad.
.getad.
.onlinead.

/\W(blog|get|online)ad\W/


How do you know this?

mcm_ham wrote:

This is because it starts the pattern matching from the beginning of the url. And since it has to initially test the three or cases this is really just the same as having three different filters.


I don't believe this to be true, so I will defer to someone who does know for sure.

mcm_ham wrote:

However, in trying to understand the regexp documentation it seems possible to start pattern matching from the end. In which case the above filter will bring about a speed increase. I gather this is done like this:

/\W(blog|get|online)ad\W/g


The 'g' flag is global, and says to continue after the initial find, which Adblock does by default anyway.

mcm_ham wrote:

However, AdBlock I've been told automatically handles case sensitive issues by making filters case insensitive [...]


True.

mcm_ham wrote:

[...] and the '/' character appearing in the middle of a filter by escaping like this "\/".


True.

mcm_ham wrote:

Of course if it was intelligent it would only escape the '/' characters that appear between the outer two.


It does.

mcm_ham wrote:

If it does escape the last one because there is a character after it then the flag wouldn't work.


It doesn't

mcm_ham wrote:

So my question is, have I understood the situation correctly and is there a way to start pattern matching from the end for speed improvements?
Thanks for any help.


Forward or backwards it should be the same speed.
_________________
My Firefox Page
NOTE: Firefox is spelled F-i-r-e-f-o-x; only the first letter capitalized. The preferred abbreviation is Fx or fx.
Back to top
View user's profile Send private message
mcm_ham



Joined: 17 Dec 2004
Posts: 310

PostPosted: Mon Feb 28, 2005    Post subject: Reply with quote

rue wrote:
Regexp can compress "patterns" for speed. But, for unique terms, it wouldn't do much.

Ex: ad, adstuff, adworld -> slower than -> /ad(stuff|world)/
:
Ex: un, related, terms -> nearly same -> /un|related|terms/

rue wrote:
Anonymous wrote:
maybe I'm not understanding how adblock works but wouldn't a long list slow down page load time?

Actually it wouldn't. As I explained, the latest builds ignore beginning / ending wildcards in Simple Filters. The list is fine.

rue wrote:
3. all pre-existing forward-slashes must be turned into literals. just search+replace '/' with '\/'.


Ok, a lot of those suppositions were my own so I don't know for sure. The quotes are the best info I can find but don't help much. The third quote could be for an old version though.
Back to top
View user's profile Send private message Visit poster's website
wonkothesane
The Other Developer


Joined: 22 May 2004
Posts: 210

PostPosted: Mon Feb 28, 2005    Post subject: Reply with quote

You're free to look at the actual code Adblock uses for details, but this is roughly how things work:

If a filter starts with '/' and ends with '/' (e.g. matches /^\/.*\/$/) then the beginning and ending slashes are stripped off (filter.slice(1,-1)Wink and the resulting filter is passed to the RegExp constructor to create the RegExp object to use to .test() against each URL on the page as passed to shouldLoad. The RegExp constructor automatically escapes slashes for us, which is why you can write filters like /http://example\.com/ and have it work.

However, when you add that 'g' to the end, it makes Adblock think you haven't entered a regexp, but a plain string to be matched against instead (think: /ads/*). Oh, and Ken: no, Adblock does not make RegExps global; not much point, when all we're interested in is a boolean yes-or-no decision on whether a URL matched the pattern.

Non-regexp patterns have "special" characters escaped, leading and trailing asterisks removed, and "internal" asterisks converted to ".*?"; the resulting pattern is passed to the RegExp constructor.

Before making any assumptions about speed, it would behoove you to construct simple iterative tests in the JavaScript Console to get a better handle of how much time it actually takes to call .test(). As it turns out, callling .test() on a RegExp object is just as fast as a string-based indexOf(), which certainly surprised rue and myself. In 0.5, there was a distinction between "simple" filters and regexps, because rue assumed that .indexOf() was faster than .test(); later testing shattered that assumption.

**edit**
In some cases, .test can be much (4x) faster than .indexOf. It's most noticable for short strings; when the test-string gets to ~200 characters, the methods are about equal in speed. Try copy & pasting the following two code snippets in to the JS Console. I get ~760 for .indexOf and ~188 for .test

Code:

var st =new Date().getTime(); var i = 64000, checkRe = /favicon\.ico/; while(i--) { ('http://aasted.org/adblock/favicon.ico').indexOf('favicon.ico') != -1; } var d = new Date().getTime() ; d-st

var st =new Date().getTime(); var i = 64000, checkRe = /favicon\.ico/; while(--i) { checkRe.test('http://aasted.org/adblock/favicon.ico'); } var d = new Date().getTime() ; d-st;


And yes, note that on my machine (2.4 GHz P4) sixty four thousand calls to .test() takes a total of 188 milliseconds, or a fifth of a second. Virtually none of the time Adblock spends executing in shouldLoad is spent on the actual RegExp comparisons; it's all the surrounding code before and after the .test that's the bulk of execution time.
**/edit**

In any case, the length of the string being matched and the complexity of the RegExp do have some impact on execution speed, but not nearly as much as you might think. In general, if you want to make Adblock faster, condense your filterlist into as few filters as possible.

In response to your original question: no, there's no way to tell the RegExp engine what to do.

It is not necessary to escape forward slashes.

A long list will take longer to iterate through than a shorter list. Whether that translates into a noticeable slow-down to the user will depend of more factors than list length alone.
Back to top
View user's profile Send private message Send e-mail Visit poster's website AIM Address
Ken Cooper



Joined: 29 Dec 2004
Posts: 110
Location: Holland, MI USA

PostPosted: Mon Feb 28, 2005    Post subject: Reply with quote

wonkothesane wrote:
Oh, and Ken: no, Adblock does not make RegExps global; not much point, when all we're interested in is a boolean yes-or-no decision on whether a URL matched the pattern.


Once a match is found the same filter can be used over again, making it global?

wonkothesane wrote:

In some cases, .test can be much (4x) faster than .indexOf. It's most noticable for short strings; when the test-string gets to ~200 characters, the methods are about equal in speed.


If I understand you correctly, and please confirm or correct me if I am wrong, a filter string in excess of 200 characters is slow, and should be made into two or more shorter filter strings?

What would be the most efficient maximum RegEx filter string length?
_________________
My Firefox Page
NOTE: Firefox is spelled F-i-r-e-f-o-x; only the first letter capitalized. The preferred abbreviation is Fx or fx.
Back to top
View user's profile Send private message
rue
Developer


Joined: 22 Oct 2003
Posts: 752

PostPosted: Mon Feb 28, 2005    Post subject: Reply with quote

Ken:
Wonko was referring to the url's length, not the filter's. He also didn't offer a complete view of the performance-difference:
the trend wrote:
filter1: /myShortFilter/
filter2: "myShortFilter"
url length-50: filter1 is faster
url length-176: both filters run equal
url length-500: filter2 is faster

As for forward-slashes, Adblock v.5 was permissive in allowing them unescaped. But the JS RegExp literal syntax requires all forward-slashes be preceeded by a backslash. Since the details that permit the "lack of escape" might change at any point, it's rather advised to stay with proper syntax.
Back to top
View user's profile Send private message
Ken Cooper



Joined: 29 Dec 2004
Posts: 110
Location: Holland, MI USA

PostPosted: Mon Feb 28, 2005    Post subject: Reply with quote

Okay Rue, a few questions:

If you know most URL's will be <176 characters use a RegEx, otherwise use Simple?

In regards to the forward-slash, are you advocating that all forward-slashes now be escaped to comply with the literal syntax? Will 0.6 require these to be escaped, or can we keep cheating?
_________________
My Firefox Page
NOTE: Firefox is spelled F-i-r-e-f-o-x; only the first letter capitalized. The preferred abbreviation is Fx or fx.
Back to top
View user's profile Send private message
wonkothesane
The Other Developer


Joined: 22 May 2004
Posts: 210

PostPosted: Mon Feb 28, 2005    Post subject: Reply with quote

Ken: Being 'global' means a regexp will continue to search a string after it's found a match. For example, /\d/.test('1 plus 2 is 3.'); would (should) stop after the first character has satisfied the RE, but if that re were global (/\d/g) the whole string would be searched, which is unnecessary.

As I noted before: we're talking about a hundredth of a millisecond execution time for re.test(string). It's much faster for Adblock to test one long regexp than three short ones.

Technically, it's not cheating to not escape forward-slashes. Filter strings are not constructed with the literal syntax, which does require escaped forward slashes; instead, the string (minus leading and trailing slashes) is passed to the RegExp constructor. The above method ("details", as rue calls it) is highly unlikely to change.

In future-versions, simple filters will most-likely be transformed into RegExp -- meaning the trend rue posted will remain specific to v.5. Additional Note: it's quite rare to see URLs over 200 characters or so.
Back to top
View user's profile Send private message Send e-mail Visit poster's website AIM Address
Ken Cooper



Joined: 29 Dec 2004
Posts: 110
Location: Holland, MI USA

PostPosted: Mon Feb 28, 2005    Post subject: Reply with quote

wonkothesane wrote:
Ken: Being 'global' means a regexp will continue to search a string after it's found a match. For example, /\d/.test('1 plus 2 is 3.'); would (should) stop after the first character has satisfied the RE, but if that re were global (/\d/g) the whole string would be searched, which is unnecessary.


Thanks for the explanation wonkothesane. My idea of "global" did not take into consideration the continuance of the string. So therefore adding the 'g' flag in this application, would actually slow down Adblock.

wonkothesane wrote:

It's much faster for Adblock to test one long regexp than three short ones.


Noted!

wonkothesane wrote:

[...] meaning the trend rue posted will remain specific to v.5. [...]


Noted!
_________________
My Firefox Page
NOTE: Firefox is spelled F-i-r-e-f-o-x; only the first letter capitalized. The preferred abbreviation is Fx or fx.
Back to top
View user's profile Send private message
Paulfox
Guest





PostPosted: Tue Mar 01, 2005    Post subject: Great Stuff! Reply with quote

Well, this is perhaps the best thread EVER - both KenCooper and mcm_ham have taught me LOADS of stuff, and I learned more today here, because the three of us are "tightening the screws" of knowledge with this script.

Just fantastic. The not having to escape forward slashes is the reason my yahoo filter works (and well!); I may post here later with a couple of specifics myself but want to make it as specific, short and sweet as possible.

Many thanks to all here . . . AdBlock is just AMAZING and anyone connected with it should be knighted!
Back to top
mcm_ham



Joined: 17 Dec 2004
Posts: 310

PostPosted: Tue Mar 01, 2005    Post subject: Reply with quote

Thanks for all the information, indexOf being slower then RegExp suprises me too. I think you are right wonkothesane writing some simple tests in java instead of making assumptions is the way to go, either that or try and work out the Big Oh notation to see how it scales.
Back to top
View user's profile Send private message Visit poster's website
MW
Guest





PostPosted: Tue Mar 08, 2005    Post subject: Reply with quote

rue wrote:
Ken:
Wonko was referring to the url's length, not the filter's. He also didn't offer a complete view of the performance-difference:
the trend wrote:
filter1: /myShortFilter/
filter2: "myShortFilter"
url length-50: filter1 is faster
url length-176: both filters run equal
url length-500: filter2 is faster

As for forward-slashes, Adblock v.5 was permissive in allowing them unescaped. But the JS RegExp literal syntax requires all forward-slashes be preceeded by a backslash. Since the details that permit the "lack of escape" might change at any point, it's rather advised to stay with proper syntax.

rue: based on the .test() .indexOf() stuff mentioned before and the 200+ url length, does this post still hold true

http://aasted.org/adblock/viewtopic.php?p=1360#1360
rue wrote:
Posted: Mon Jan 12, 2004 Post subject: Reply with quote
MW:
I know you wouldn't inherently know this, but simple-filters which include a complete host-name are matched faster than any other filter. This is because all such filters' hostnames are copied into a hash and key-matched against the element's host.
.
So http://pagead2.googlesyndication.com/ matches faster than /\/pagead2.googlesyndication.com\//.
Back to top
Guest






PostPosted: Tue Mar 08, 2005    Post subject: Reply with quote

wonkothesane wrote:


Before making any assumptions about speed, it would behoove you to construct simple iterative tests in the JavaScript Console to get a better handle of how much time it actually takes to call .test(). As it turns out, callling .test() on a RegExp object is just as fast as a string-based indexOf(), which certainly surprised rue and myself. In 0.5, there was a distinction between "simple" filters and regexps, because rue assumed that .indexOf() was faster than .test(); later testing shattered that assumption.

**edit**
In some cases, .test can be much (4x) faster than .indexOf. It's most noticable for short strings; when the test-string gets to ~200 characters, the methods are about equal in speed. Try copy & pasting the following two code snippets in to the JS Console. I get ~760 for .indexOf and ~188 for .test

Code:

var st =new Date().getTime(); var i = 64000, checkRe = /favicon\.ico/; while(i--) { ('http://aasted.org/adblock/favicon.ico').indexOf('favicon.ico') != -1; } var d = new Date().getTime() ; d-st

var st =new Date().getTime(); var i = 64000, checkRe = /favicon\.ico/; while(--i) { checkRe.test('http://aasted.org/adblock/favicon.ico'); } var d = new Date().getTime() ; d-st;


And yes, note that on my machine (2.4 GHz P4) sixty four thousand calls to .test() takes a total of 188 milliseconds, or a fifth of a second. Virtually none of the time Adblock spends executing in shouldLoad is spent on the actual RegExp comparisons; it's all the surrounding code before and after the .test that's the bulk of execution time.
**/edit**
I actually get the exact opposite results on a slow processor indexOf() is QUICKER than test()

test machine PII 400

.indexOf = ~ 2690
.test = ~ 19390

lowering to 4000 calls results in
.indexOf = ~ 170
.test = ~ 1260

so the comparision test is very dependant on processor speed (duh hehe since we are calculating lol). The actual ratio increased exponentially as i increased the calls.
Back to top
someone



Joined: 06 Mar 2005
Posts: 14

PostPosted: Tue Mar 08, 2005    Post subject: Reply with quote

Maybe it could be an option on 0.6
Another suggestion is to have a little stop watch type of thing to show users how much time it is taking to block the ads.
Back to top
View user's profile Send private message
wonkothesane
The Other Developer


Joined: 22 May 2004
Posts: 210

PostPosted: Wed Mar 09, 2005    Post subject: Reply with quote

MW: It's still true for v.5 -- that's all I'm allowed to say.

Guest: That's odd. I wonder if the relative speed difference is actually processor-based or dependent on another factor, like RAM (I've got a healthy 1 GB). Still, though: 4000 .test() calls take just over a second to complete, which means each call is using a fourth of a millisecond of processor time. Since each page is unlikely to have more than 80 calls to .test(), I think 20 ms of processor time per page on a 400 MHz Pentium II devoted to .test() isn't too exorbiant. That's rather more than most pages will generate (where calls = filters * elements).

someone: no, being able to treat all filters as regexp objects far outweighs the minor speed delta between .test and .indexOf.
Back to top
View user's profile Send private message Send e-mail Visit poster's website AIM Address
Display posts from previous:   
Post new topic   Reply to topic    The Adblock Project Forum Index -> Main All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group