X-Git-Url: http://www.privoxy.org/gitweb/?p=privoxy.git;a=blobdiff_plain;f=doc%2Fwebserver%2Fuser-manual%2Fappendix.html;h=af7e1319f8259b8945ae8ec743063cc0f2494c55;hp=e5356f36b8b5667998afb95a5785caf955fca7da;hb=60cbbc5f5d7514135bc5afc02d24e77a231c47f4;hpb=7017546f48d1189837d3b8d6a523328195279e57 diff --git a/doc/webserver/user-manual/appendix.html b/doc/webserver/user-manual/appendix.html index e5356f36..af7e1319 100644 --- a/doc/webserver/user-manual/appendix.html +++ b/doc/webserver/user-manual/appendix.html @@ -1,1565 +1,1090 @@ -
Privoxy User Manual | ||
---|---|---|
Prev |
Privoxy can use "regular expressions" - in various config files. Assuming support for "pcre" (Perl - Compatible Regular Expressions) is compiled in, which is the default. Such - configuration directives do not require regular expressions, but they can be - used to increase flexibility by matching a pattern with wild-cards against - URLs.
If you are reading this, you probably don't understand what "regular - expressions" are, or what they can do. So this will be a very brief - introduction only. A full explanation would require a book ;-)
"Regular expressions" is a way of matching one character - expression against another to see if it matches or not. One of the - "expressions" is a literal string of readable characters - (letter, numbers, etc), and the other is a complex string of literal - characters combined with wild-cards, and other special characters, called - meta-characters. The "meta-characters" have special meanings and - are used to build the complex pattern to be matched against. Perl Compatible - Regular Expressions is an enhanced form of the regular expression language - with backward compatibility.
To make a simple analogy, we do something similar when we use wild-card - characters when listing files with the dir command in DOS. - *.* matches all filenames. The "special" - character here is the asterisk which matches any and all characters. We can be - more specific and use ? to match just individual - characters. So "dir file?.text" would match - "file1.txt", "file2.txt", etc. We are pattern - matching, using a similar technique to "regular expressions"!
Regular expressions do essentially the same thing, but are much, much more - powerful. There are many more "special characters" and ways of - building complex patterns however. Let's look at a few of the common ones, - and then some examples:
. - Matches any single character, e.g. "a", - "A", "4", ":", or "@". - |
? - The preceding character or expression is matched ZERO or ONE - times. Either/or. - |
+ - The preceding character or expression is matched ONE or MORE - times. - |
* - The preceding character or expression is matched ZERO or MORE - times. - |
\ - The "escape" character denotes that - the following character should be taken literally. This is used where one of the - special characters (e.g. ".") needs to be taken literally and - not as a special meta-character. - |
[] - Characters enclosed in brackets will be matched if - any of the enclosed characters are encountered. - |
() - parentheses are used to group a sub-expression, - or multiple sub-expressions. - |
| - The "bar" character works like an - "or" conditional statement. A match is successful if the - sub-expression on either side of "|" matches. - |
s/string1/string2/g - This is used to rewrite strings of text. - "string1" is replaced by "string2" in this - example. - |
These are just some of the ones you are likely to use when matching URLs with - Privoxy, and is a long way from a definitive - list. This is enough to get us started with a few simple examples which may - be more illuminating:
/.*/banners/.* - A simple example - that uses the common combination of "." and "*" to - denote any character, zero or more times. In other words, any string at all. - So we start with a literal forward slash, then our regular expression pattern - (".*") another literal forward slash, the string - "banners", another forward slash, and lastly another - ".*". We are building - a directory path here. This will match any file with the path that has a - directory named "banners" in it. The ".*" matches - any characters, and this could conceivably be more forward slashes, so it - might expand into a much longer looking path. For example, this could match: - "/eye/hate/spammers/banners/annoy_me_please.gif", or just - "/banners/annoying.html", or almost an infinite number of other - possible combinations, just so it has "banners" in the path - somewhere.
A now something a little more complex:
/.*/adv((er)?ts?|ertis(ing|ements?))?/ - - We have several literal forward slashes again ("/"), so we are - building another expression that is a file path statement. We have another - ".*", so we are matching against any conceivable sub-path, just so - it matches our expression. The only true literal that must - match our pattern is adv, together with - the forward slashes. What comes after the "adv" string is the - interesting part.
Remember the "?" means the preceding expression (either a - literal character or anything grouped with "(...)" in this case) - can exist or not, since this means either zero or one match. So - "((er)?ts?|ertis(ing|ements?))" is optional, as are the - individual sub-expressions: "(er)", - "(ing|ements?)", and the "s". The "|" - means "or". We have two of those. For instance, - "(ing|ements?)", can expand to match either "ing" - OR "ements?". What is being done here, is an - attempt at matching as many variations of "advertisement", and - similar, as possible. So this would expand to match just "adv", - or "advert", or "adverts", or - "advertising", or "advertisement", or - "advertisements". You get the idea. But it would not match - "advertizements" (with a "z"). We could fix that by - changing our regular expression to: - "/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/", which would then match - either spelling.
/.*/advert[0-9]+\.(gif|jpe?g) - Again - another path statement with forward slashes. Anything in the square brackets - "[]" can be matched. This is using "0-9" as a - shorthand expression to mean any digit one through nine. It is the same as - saying "0123456789". So any digit matches. The "+" - means one or more of the preceding expression must be included. The preceding - expression here is what is in the square brackets -- in this case, any digit - one through nine. Then, at the end, we have a grouping: "(gif|jpe?g)". - This includes a "|", so this needs to match the expression on - either side of that bar character also. A simple "gif" on one side, and the other - side will in turn match either "jpeg" or "jpg", - since the "?" means the letter "e" is optional and - can be matched once or not at all. So we are building an expression here to - match image GIF or JPEG type image file. It must include the literal - string "advert", then one or more digits, and a "." - (which is now a literal, and not a special character, since it is escaped - with "\"), and lastly either "gif", or - "jpeg", or "jpg". Some possible matches would - include: "//advert1.jpg", - "/nasty/ads/advert1234.gif", - "/banners/from/hell/advert99.jpg". It would not match - "advert1.gif" (no leading slash), or - "/adverts232.jpg" (the expression does not include an - "s"), or "/advert1.jsp" ("jsp" is not - in the expression anywhere).
s/microsoft(?!.com)/MicroSuck/i - This is - a substitution. "MicroSuck" will replace any occurrence of - "microsoft". The "i" at the end of the expression - means ignore case. The "(?!.com)" means - the match should fail if "microsoft" is followed by - ".com". In other words, this acts like a "NOT" - modifier. In case this is a hyperlink, we don't want to break it ;-).
We are barely scratching the surface of regular expressions here so that you - can understand the default Privoxy - configuration files, and maybe use this knowledge to customize your own - installation. There is much, much more that can be done with regular - expressions. Now that you know enough to get started, you can learn more on - your own :/
More reading on Perl Compatible Regular expressions: - http://www.perldoc.com/perl5.6/pod/perlre.html
Since Privoxy proxies each requested - web page, it is easy for Privoxy to - trap certain special URLs. In this way, we can talk directly to - Privoxy, and see how it is - configured, see how our rules are being applied, change these - rules and other configuration options, and even turn - Privoxy's filtering off, all with - a web browser.
The URLs listed below are the special ones that allow direct access - to Privoxy. Of course, - Privoxy must be running to access these. If - not, you will get a friendly error message. Internet access is not - necessary either.
- Privoxy main page: -
Alternately, this may be reached at http://p.p/, but this - variation may not work as reliably as the above in some configurations. -
- Show information about the current configuration: -
- Show the source code version numbers: -
- Show the client's request headers: -
- Show which actions apply to a URL and why: -
- Toggle Privoxy on or off. In this case, "Privoxy" continues - to run, but only as a pass-through proxy, with no actions taking place: -
Short cuts. Turn off, then on: -
- Edit the actions list file: -
These may be bookmarked for quick reference.
Below are some "bookmarklets" to allow you to easily access a - "mini" version of some of Privoxy's - special pages. They are designed for MS Internet Explorer, but should work - equally well in Netscape, Mozilla, and other browsers which support - JavaScript. They are designed to run directly from your bookmarks - not by - clicking the links below (although that should work for testing).
To save them, right-click the link and choose "Add to Favorites" - (IE) or "Add Bookmark" (Netscape). You will get a warning that - the bookmark "may not be safe" - just click OK. Then you can run the - Bookmarklet directly from your favourites/bookmarks. For even faster access, - you can put them on the "Links" bar (IE) or the "Personal - Toolbar" (Netscape), and run them with a single click.
Toggle Privoxy (Toggles between enabled and disabled) -
Credit: The site which gave me the general idea for these bookmarklets is - www.bookmarklets.com. They - have more information about bookmarklets.
The way Privoxy applies "actions" - and "filters" to any given URL can be complex, and not always so - easy to understand what is happening. And sometimes we need to be able to - see just what Privoxy is - doing. Especially, if something Privoxy is doing - is causing us a problem inadvertantly. It can be a little daunting to look at - the actions and filters files themselves, since they tend to be filled with - "regular expressions" whose consequences are not always - so obvious. Privoxy provides the - http://config.privoxy.org/show-url-info - page that can show us very specifically how actions - are being applied to any given URL. This is a big help for troubleshooting. -
First, enter one URL (or partial URL) at the prompt, and then - Privoxy will tell us - how the current configuration will handle it. This will not - help with filtering effects from the default.filter file! It - also will not tell you about any other URLs that may be embedded within the - URL you are testing. For instance, images such as ads are expressed as URLs - within the raw page source of HTML pages. So you will only get info for the - actual URL that is pasted into the prompt area -- not any sub-URLs. If you - want to know about embedded URLs like ads, you will have to dig those out of - the HTML source. Use your browser's "View Page Source" option - for this. Or right click on the ad, and grab the URL.
Let's look at an example, google.com, - one section at a time:
System default actions: - - { -add-header -block -deanimate-gifs -downgrade -fast-redirects -filter - -hide-forwarded -hide-from -hide-referer -hide-user-agent -image - -image-blocker -limit-connect -no-compression -no-cookies-keep - -no-cookies-read -no-cookies-set -no-popups -vanilla-wafer -wafer } - - |
This is the top section, and only tells us of the compiled in defaults. This - is basically what Privoxy would do if there - were not any "actions" defined, i.e. it does nothing. Every action - is disabled. This is not particularly informative for our purposes here. OK, - next section:
Matches for http://google.com: - - { -add-header -block +deanimate-gifs -downgrade +fast-redirects - +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups} - +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal} - +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge} - -hide-user-agent -image +image-blocker{blank} +no-compression - +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups - -vanilla-wafer -wafer } - / + + + + + |
Ooops, the "/adsl/" is matching "/ads"! But - we did not want this at all! Now we see why we get the blank page. We could - now add a new action below this that explictly does not - block (-block) pages with "adsl". There are various ways to - handle such exceptions. Example:
{ -block } - /adsl - - |
Now the page displays ;-) Be sure to flush your browser's caches when - making such changes. Or, try using Shift+Reload.
But now what about a situation where we get no explicit matches like - we did with:
{ -block } +
That actually was very telling and pointed us quickly to where the problem - was. If you don't get this kind of match, then it means one of the default - rules in the first section is causing the problem. This would require some - guesswork, and maybe a little trial and error to isolate the offending rule. - One likely cause would be one of the "{+filter}" actions. Try - adding the URL for the site to one of aliases that turn off "+filter":
Now the page displays ;-) Remember to flush your browser's caches + when making these kinds of changes to your configuration to insure that + you get a freshly delivered page! Or, try using Shift+Reload. + +But now what about a situation where we get no explicit matches like + we did with: + +
That actually was very helpful and pointed us quickly to where the + problem was. If you don't get this kind of match, then it means one of + the default rules in the first section of default.action is causing the problem. This would + require some guesswork, and maybe a little trial and error to isolate + the offending rule. One likely cause would be one of the "+filter" + actions. These tend to be harder to troubleshoot. Try adding the URL + for the site to one of aliases that turn off "+filter": + +
"{shop}" is an "alias" that expands to - "{ -filter -no-cookies -no-cookies-keep }". Or you could do - your own exception to negate filtering:
"{ shop }" is an + "alias" that expands to "{ -filter -session-cookies-only + }". Or you could do your own exception to negate + filtering: + +
"{fragile}" is an alias that disables most actions. This can be - used as a last resort for problem sites. Remember to flush caches! If this - still does not work, you will have to go through the remaining actions one by - one to find which one(s) is causing the problem. \ No newline at end of file + developer.ibm.com + localhost + + |
+
This would turn off all filtering for these sites. This is best put + in user.action, for local site exceptions. + Note that when a simple domain pattern is used by itself (without the + subsequent path portion), all sub-pages within that domain are included + automatically in the scope of the action.
+ +Images that are inexplicably being blocked, may well be hitting the + "+filter{banners-by-size}" rule, which assumes that + images of certain sizes are ad banners (works well most of the time since these + tend to be standardized).
+ +"{ fragile }" is + an alias that disables most actions that are the most likely to cause + trouble. This can be used as a last resort for problem sites.
+ +
+ + { fragile } + # Handle with care: easy to break + mail.google. + mybank.example.com ++ |
+
Remember to flush + caches! Note that the mail.google + reference lacks the TLD portion (e.g. ".com"). This will effectively match any TLD with + google in it, such as mail.google.de., just as an example.
+ +If this still does not work, you will have to go through the + remaining actions one by one to find which one(s) is causing the + problem.
+