X-Git-Url: http://www.privoxy.org/gitweb/?a=blobdiff_plain;f=doc%2Fwebserver%2Fuser-manual%2Fappendix.html;h=d91495431a6201f253a63dfd88562f3c6db30066;hb=57d8b0e0ca7e2d24bd29b3a4d3f5a648a38ae393;hp=c6896bc3c0d57874da3b0216d5ac9b6e9d9ba01d;hpb=bb351be8595d489bc90f06f300aeef011aa2f8f4;p=privoxy.git diff --git a/doc/webserver/user-manual/appendix.html b/doc/webserver/user-manual/appendix.html index c6896bc3..d9149543 100644 --- a/doc/webserver/user-manual/appendix.html +++ b/doc/webserver/user-manual/appendix.html @@ -1,1454 +1,377 @@ - -
Privoxy 3.0.11 User Manual | ||
---|---|---|
Prev |
Privoxy uses Perl-style "regular - expressions" in its actions - files and filter file, - through the PCRE and - PCRS libraries.
If you are reading this, you probably don't understand what "regular - expressions" are, or what they can do. So this will be a very brief - introduction only. A full explanation would require a book ;-)
Regular expressions provide a language to describe patterns that can be - run against strings of characters (letter, numbers, etc), to see if they - match the string or not. The patterns are themselves (sometimes complex) - strings of literal characters, combined with wild-cards, and other special - characters, called meta-characters. The "meta-characters" have - special meanings and are used to build complex patterns to be matched against. - Perl Compatible Regular Expressions are an especially convenient - "dialect" of the regular expression language.
To make a simple analogy, we do something similar when we use wild-card - characters when listing files with the dir command in DOS. - *.* matches all filenames. The "special" - character here is the asterisk which matches any and all characters. We can be - more specific and use ? to match just individual - characters. So "dir file?.text" would match - "file1.txt", "file2.txt", etc. We are pattern - matching, using a similar technique to "regular expressions"!
Regular expressions do essentially the same thing, but are much, much more - powerful. There are many more "special characters" and ways of - building complex patterns however. Let's look at a few of the common ones, - and then some examples:
. - Matches any single character, e.g. "a", - "A", "4", ":", or "@". - |
? - The preceding character or expression is matched ZERO or ONE - times. Either/or. - |
+ - The preceding character or expression is matched ONE or MORE - times. - |
* - The preceding character or expression is matched ZERO or MORE - times. - |
\ - The "escape" character denotes that - the following character should be taken literally. This is used where one of the - special characters (e.g. ".") needs to be taken literally and - not as a special meta-character. Example: "example\.com", makes - sure the period is recognized only as a period (and not expanded to its - meta-character meaning of any single character). - |
[ ] - Characters enclosed in brackets will be matched if - any of the enclosed characters are encountered. For instance, "[0-9]" - matches any numeric digit (zero through nine). As an example, we can combine - this with "+" to match any digit one of more times: "[0-9]+". - |
( ) - parentheses are used to group a sub-expression, - or multiple sub-expressions. - |
| - The "bar" character works like an - "or" conditional statement. A match is successful if the - sub-expression on either side of "|" matches. As an example: - "/(this|that) example/" uses grouping and the bar character - and would match either "this example" or "that - example", and nothing else. - |
These are just some of the ones you are likely to use when matching URLs with - Privoxy, and is a long way from a definitive - list. This is enough to get us started with a few simple examples which may - be more illuminating:
/.*/banners/.* - A simple example - that uses the common combination of "." and "*" to - denote any character, zero or more times. In other words, any string at all. - So we start with a literal forward slash, then our regular expression pattern - (".*") another literal forward slash, the string - "banners", another forward slash, and lastly another - ".*". We are building - a directory path here. This will match any file with the path that has a - directory named "banners" in it. The ".*" matches - any characters, and this could conceivably be more forward slashes, so it - might expand into a much longer looking path. For example, this could match: - "/eye/hate/spammers/banners/annoy_me_please.gif", or just - "/banners/annoying.html", or almost an infinite number of other - possible combinations, just so it has "banners" in the path - somewhere.
And now something a little more complex:
/.*/adv((er)?ts?|ertis(ing|ements?))?/ - - We have several literal forward slashes again ("/"), so we are - building another expression that is a file path statement. We have another - ".*", so we are matching against any conceivable sub-path, just so - it matches our expression. The only true literal that must - match our pattern is adv, together with - the forward slashes. What comes after the "adv" string is the - interesting part.
Remember the "?" means the preceding expression (either a - literal character or anything grouped with "(...)" in this case) - can exist or not, since this means either zero or one match. So - "((er)?ts?|ertis(ing|ements?))" is optional, as are the - individual sub-expressions: "(er)", - "(ing|ements?)", and the "s". The "|" - means "or". We have two of those. For instance, - "(ing|ements?)", can expand to match either "ing" - OR "ements?". What is being done here, is an - attempt at matching as many variations of "advertisement", and - similar, as possible. So this would expand to match just "adv", - or "advert", or "adverts", or - "advertising", or "advertisement", or - "advertisements". You get the idea. But it would not match - "advertizements" (with a "z"). We could fix that by - changing our regular expression to: - "/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/", which would then match - either spelling.
/.*/advert[0-9]+\.(gif|jpe?g) - Again - another path statement with forward slashes. Anything in the square brackets - "[ ]" can be matched. This is using "0-9" as a - shorthand expression to mean any digit one through nine. It is the same as - saying "0123456789". So any digit matches. The "+" - means one or more of the preceding expression must be included. The preceding - expression here is what is in the square brackets -- in this case, any digit - one through nine. Then, at the end, we have a grouping: "(gif|jpe?g)". - This includes a "|", so this needs to match the expression on - either side of that bar character also. A simple "gif" on one side, and the other - side will in turn match either "jpeg" or "jpg", - since the "?" means the letter "e" is optional and - can be matched once or not at all. So we are building an expression here to - match image GIF or JPEG type image file. It must include the literal - string "advert", then one or more digits, and a "." - (which is now a literal, and not a special character, since it is escaped - with "\"), and lastly either "gif", or - "jpeg", or "jpg". Some possible matches would - include: "//advert1.jpg", - "/nasty/ads/advert1234.gif", - "/banners/from/hell/advert99.jpg". It would not match - "advert1.gif" (no leading slash), or - "/adverts232.jpg" (the expression does not include an - "s"), or "/advert1.jsp" ("jsp" is not - in the expression anywhere).
We are barely scratching the surface of regular expressions here so that you - can understand the default Privoxy - configuration files, and maybe use this knowledge to customize your own - installation. There is much, much more that can be done with regular - expressions. Now that you know enough to get started, you can learn more on - your own :/
More reading on Perl Compatible Regular expressions: - http://perldoc.perl.org/perlre.html
For information on regular expression based substitutions and their applications - in filters, please see the filter file tutorial - in this manual.
Since Privoxy proxies each requested - web page, it is easy for Privoxy to - trap certain special URLs. In this way, we can talk directly to - Privoxy, and see how it is - configured, see how our rules are being applied, change these - rules and other configuration options, and even turn - Privoxy's filtering off, all with - a web browser.
The URLs listed below are the special ones that allow direct access - to Privoxy. Of course, - Privoxy must be running to access these. If - not, you will get a friendly error message. Internet access is not - necessary either.
- Privoxy main page: -
There is a shortcut: http://p.p/ (But it - doesn't provide a fall-back to a real page, in case the request is not - sent through Privoxy) -
- Show information about the current configuration, including viewing and - editing of actions files: -
- Show the source code version numbers: -
- Show the browser's request headers: -
- Show which actions apply to a URL and why: -
- Toggle Privoxy on or off. This feature can be turned off/on in the main - config file. When toggled "off", "Privoxy" - continues to run, but only as a pass-through proxy, with no actions taking - place: -
Short cuts. Turn off, then on: -
These may be bookmarked for quick reference. See next.
Below are some "bookmarklets" to allow you to easily access a - "mini" version of some of Privoxy's - special pages. They are designed for MS Internet Explorer, but should work - equally well in Netscape, Mozilla, and other browsers which support - JavaScript. They are designed to run directly from your bookmarks - not by - clicking the links below (although that should work for testing).
To save them, right-click the link and choose "Add to Favorites" - (IE) or "Add Bookmark" (Netscape). You will get a warning that - the bookmark "may not be safe" - just click OK. Then you can run the - Bookmarklet directly from your favorites/bookmarks. For even faster access, - you can put them on the "Links" bar (IE) or the "Personal - Toolbar" (Netscape), and run them with a single click.
Privoxy - Toggle Privoxy (Toggles between enabled and disabled) -
Credit: The site which gave us the general idea for these bookmarklets is - www.bookmarklets.com. They - have more information about bookmarklets.
Let's take a quick look at how some of Privoxy's - core features are triggered, and the ensuing sequence of events when a web - page is requested by your browser:
First, your web browser requests a web page. The browser knows to send - the request to Privoxy, which will in turn, - relay the request to the remote web server after passing the following - tests: -
Privoxy traps any request for its own internal CGI - pages (e.g http://p.p/) and sends the CGI page back to the browser. -
Next, Privoxy checks to see if the URL - matches any "+block" patterns. If - so, the URL is then blocked, and the remote web server will not be contacted. - "+handle-as-image" - and - "+handle-as-empty-document" - are then checked, and if there is no match, an - HTML "BLOCKED" page is sent back to the browser. Otherwise, if - it does match, an image is returned for the former, and an empty text - document for the latter. The type of image would depend on the setting of - "+set-image-blocker" - (blank, checkerboard pattern, or an HTTP redirect to an image elsewhere). -
Untrusted URLs are blocked. If URLs are being added to the - trust file, then that is done. -
If the URL pattern matches the "+fast-redirects" action, - it is then processed. Unwanted parts of the requested URL are stripped. -
Now the rest of the client browser's request headers are processed. If any - of these match any of the relevant actions (e.g. "+hide-user-agent", - etc.), headers are suppressed or forged as determined by these actions and - their parameters. -
Now the web server starts sending its response back (i.e. typically a web - page). -
First, the server headers are read and processed to determine, among other - things, the MIME type (document type) and encoding. The headers are then - filtered as determined by the - "+crunch-incoming-cookies", - "+session-cookies-only", - and "+downgrade-http-version" - actions. -
If any "+filter" action - or "+deanimate-gifs" - action applies (and the document type fits the action), the rest of the page is - read into memory (up to a configurable limit). Then the filter rules (from - default.filter and any other filter files) are - processed against the buffered content. Filters are applied in the order - they are specified in one of the filter files. Animated GIFs, if present, - are reduced to either the first or last frame, depending on the action - setting.The entire page, which is now filtered, is then sent by - Privoxy back to your browser. -
If neither a "+filter" action - or "+deanimate-gifs" - matches, then Privoxy passes the raw data through - to the client browser as it becomes available. -
As the browser receives the now (possibly filtered) page content, it - reads and then requests any URLs that may be embedded within the page - source, e.g. ad images, stylesheets, JavaScript, other HTML documents (e.g. - frames), sounds, etc. For each of these objects, the browser issues a - separate request (this is easily viewable in Privoxy's - logs). And each such request is in turn processed just as above. Note that a - complex web page will have many, many such embedded URLs. If these - secondary requests are to a different server, then quite possibly a very - differing set of actions is triggered. -
NOTE: This is somewhat of a simplistic overview of what happens with each URL - request. For the sake of brevity and simplicity, we have focused on - Privoxy's core features only.
The way Privoxy applies - actions and filters - to any given URL can be complex, and not always so - easy to understand what is happening. And sometimes we need to be able to - see just what Privoxy is - doing. Especially, if something Privoxy is doing - is causing us a problem inadvertently. It can be a little daunting to look at - the actions and filters files themselves, since they tend to be filled with - regular expressions whose consequences are not - always so obvious.
One quick test to see if Privoxy is causing a problem - or not, is to disable it temporarily. This should be the first troubleshooting - step. See the Bookmarklets section on a quick - and easy way to do this (be sure to flush caches afterward!). Looking at the - logs is a good idea too. (Note that both the toggle feature and logging are - enabled via config file settings, and may need to be - turned "on".)
Another easy troubleshooting step to try is if you have done any - customization of your installation, revert back to the installed - defaults and see if that helps. There are times the developers get complaints - about one thing or another, and the problem is more related to a customized - configuration issue.
Privoxy also provides the - http://config.privoxy.org/show-url-info - page that can show us very specifically how actions - are being applied to any given URL. This is a big help for troubleshooting.
First, enter one URL (or partial URL) at the prompt, and then - Privoxy will tell us - how the current configuration will handle it. This will not - help with filtering effects (i.e. the "+filter" action) from - one of the filter files since this is handled very - differently and not so easy to trap! It also will not tell you about any other - URLs that may be embedded within the URL you are testing. For instance, images - such as ads are expressed as URLs within the raw page source of HTML pages. So - you will only get info for the actual URL that is pasted into the prompt area - -- not any sub-URLs. If you want to know about embedded URLs like ads, you - will have to dig those out of the HTML source. Use your browser's "View - Page Source" option for this. Or right click on the ad, and grab the - URL.
Let's try an example, google.com, - and look at it one section at a time in a sample configuration (your real - configuration may vary):
Matches for http://www.google.com: + + + ++ |
+
Remember to flush caches! Note that the mail.google reference lacks the TLD portion (e.g. ".com"). This will + effectively match any TLD with google in it, such as mail.google.de., just as an example.
+If this still does not work, you will have to go through the remaining actions one by one to find which one(s) + is causing the problem.
+