X-Git-Url: http://www.privoxy.org/gitweb/?p=privoxy.git;a=blobdiff_plain;f=doc%2Fwebserver%2Fuser-manual%2Fappendix.html;h=38f4c8b0eb2b4832ac7d7cdd622129785236f396;hp=0e3f06291016b478a3f062a0d98925a02b255619;hb=42c361793c45b0d5fc0c116707ca12b2f60f4c52;hpb=d1c39df48bd2a8953ceb49fdbb370b20f3d89422 diff --git a/doc/webserver/user-manual/appendix.html b/doc/webserver/user-manual/appendix.html index 0e3f0629..38f4c8b0 100644 --- a/doc/webserver/user-manual/appendix.html +++ b/doc/webserver/user-manual/appendix.html @@ -1,586 +1,677 @@ - - + -
-Privoxy 3.0.23 User Manual | -||
---|---|---|
Prev | - -- - | - |
Privoxy uses Perl-style - "regular expressions" in its actions files and filter file, through the PCRE and PCRS libraries.
- -If you are reading this, you probably don't understand what - "regular expressions" are, or what they can - do. So this will be a very brief introduction only. A full explanation - would require a book ;-)
- -Regular expressions provide a language to describe patterns that can - be run against strings of characters (letter, numbers, etc), to see if - they match the string or not. The patterns are themselves (sometimes - complex) strings of literal characters, combined with wild-cards, and - other special characters, called meta-characters. The "meta-characters" have special meanings and are used to - build complex patterns to be matched against. Perl Compatible Regular - Expressions are an especially convenient "dialect" of the regular expression language.
- -To make a simple analogy, we do something similar when we use - wild-card characters when listing files with the dir command in DOS. *.* matches - all filenames. The "special" character here - is the asterisk which matches any and all characters. We can be more - specific and use ? to match just individual - characters. So "dir file?.text" would match - "file1.txt", "file2.txt", etc. We are pattern matching, using a - similar technique to "regular - expressions"!
- -Regular expressions do essentially the same thing, but are much, - much more powerful. There are many more "special - characters" and ways of building complex patterns however. Let's - look at a few of the common ones, and then some examples:
- -. - - Matches any single character, e.g. "a", "A", "4", ":", or - "@". | -
? - The - preceding character or expression is matched ZERO or ONE times. - Either/or. | -
+ - The - preceding character or expression is matched ONE or MORE - times. | -
* - The - preceding character or expression is matched ZERO or MORE - times. | -
\ - The - "escape" character denotes that the - following character should be taken literally. This is used where - one of the special characters (e.g. ".") needs to be taken literally and not as a - special meta-character. Example: "example\.com", makes sure the period is - recognized only as a period (and not expanded to its - meta-character meaning of any single character). | -
[ ] - - Characters enclosed in brackets will be matched if any of the - enclosed characters are encountered. For instance, "[0-9]" matches any numeric digit (zero through - nine). As an example, we can combine this with "+" to match any digit one of more times: - "[0-9]+". | -
( ) - - parentheses are used to group a sub-expression, or multiple - sub-expressions. | -
| - The - "bar" character works like an - "or" conditional statement. A match is - successful if the sub-expression on either side of "|" matches. As an example: "/(this|that) example/" uses grouping and the bar - character and would match either "this - example" or "that example", and - nothing else. | -
+ Privoxy 3.0.26 User Manual + | +||
---|---|---|
+ Prev + | ++ | ++ + | +
These are just some of the ones you are likely to use when matching - URLs with Privoxy, and is a long way - from a definitive list. This is enough to get us started with a few - simple examples which may be more illuminating:
- -/.*/banners/.* - A simple example that uses - the common combination of "." and - "*" to denote any character, zero or more - times. In other words, any string at all. So we start with a literal - forward slash, then our regular expression pattern (".*") another literal forward slash, the string - "banners", another forward slash, and lastly - another ".*". We are building a directory - path here. This will match any file with the path that has a directory - named "banners" in it. The ".*" matches any characters, and this could conceivably - be more forward slashes, so it might expand into a much longer looking - path. For example, this could match: "/eye/hate/spammers/banners/annoy_me_please.gif", or - just "/banners/annoying.html", or almost an - infinite number of other possible combinations, just so it has - "banners" in the path somewhere.
- -And now something a little more complex:
- -/.*/adv((er)?ts?|ertis(ing|ements?))?/ - We - have several literal forward slashes again ("/"), so we are building another expression that is a - file path statement. We have another ".*", - so we are matching against any conceivable sub-path, just so it matches - our expression. The only true literal that must match our pattern is - adv, together with the forward - slashes. What comes after the "adv" string - is the interesting part.
- -Remember the "?" means the preceding - expression (either a literal character or anything grouped with - "(...)" in this case) can exist or not, - since this means either zero or one match. So "((er)?ts?|ertis(ing|ements?))" is optional, as are the - individual sub-expressions: "(er)", - "(ing|ements?)", and the "s". The "|" means - "or". We have two of those. For instance, - "(ing|ements?)", can expand to match either - "ing" OR "ements?". What is - being done here, is an attempt at matching as many variations of - "advertisement", and similar, as possible. - So this would expand to match just "adv", or - "advert", or "adverts", or "advertising", - or "advertisement", or "advertisements". You get the idea. But it would not - match "advertizements" (with a "z"). We could fix that by changing our regular - expression to: "/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/", which - would then match either spelling.
- -/.*/advert[0-9]+\.(gif|jpe?g) - Again another - path statement with forward slashes. Anything in the square brackets - "[ ]" can be matched. This is using - "0-9" as a shorthand expression to mean any - digit one through nine. It is the same as saying "0123456789". So any digit matches. The "+" means one or more of the preceding expression must - be included. The preceding expression here is what is in the square - brackets -- in this case, any digit one through nine. Then, at the end, - we have a grouping: "(gif|jpe?g)". This - includes a "|", so this needs to match the - expression on either side of that bar character also. A simple - "gif" on one side, and the other side will - in turn match either "jpeg" or "jpg", since the "?" means - the letter "e" is optional and can be - matched once or not at all. So we are building an expression here to - match image GIF or JPEG type image file. It must include the literal - string "advert", then one or more digits, - and a "." (which is now a literal, and not a - special character, since it is escaped with "\"), and lastly either "gif", or "jpeg", or - "jpg". Some possible matches would include: - "//advert1.jpg", "/nasty/ads/advert1234.gif", "/banners/from/hell/advert99.jpg". It would not match - "advert1.gif" (no leading slash), or - "/adverts232.jpg" (the expression does not - include an "s"), or "/advert1.jsp" ("jsp" is not - in the expression anywhere).
- -We are barely scratching the surface of regular expressions here so - that you can understand the default Privoxy configuration files, and maybe use this - knowledge to customize your own installation. There is much, much more - that can be done with regular expressions. Now that you know enough to - get started, you can learn more on your own :/
- -More reading on Perl Compatible Regular expressions: http://perldoc.perl.org/perlre.html
- -For information on regular expression based substitutions and their - applications in filters, please see the filter file tutorial in this manual.
+Since Privoxy proxies each - requested web page, it is easy for Privoxy to trap certain special URLs. In this way, - we can talk directly to Privoxy, and - see how it is configured, see how our rules are being applied, change - these rules and other configuration options, and even turn Privoxy's filtering off, all with a web - browser.
- -The URLs listed below are the special ones that allow direct access - to Privoxy. Of course, Privoxy must be running to access these. If not, - you will get a friendly error message. Internet access is not necessary - either.
- -Privoxy main page:
- -- -- -
There is a shortcut: http://p.p/ (But it doesn't provide a fall-back to a - real page, in case the request is not sent through Privoxy)
-Show information about the current configuration, including - viewing and editing of actions files:
- -- --
Show the source code version numbers:
- -- --
Show the browser's request headers:
- -- --
Show which actions apply to a URL and why:
- -- --
Toggle Privoxy on or off. This feature can be turned off/on in - the main config file. When toggled - "off", "Privoxy" continues to run, but only as a - pass-through proxy, with no actions taking place:
- -- -- -
Short cuts. Turn off, then on:
- -- -- -
- --
Let's take a quick look at how some of Privoxy's core features are triggered, and the - ensuing sequence of events when a web page is requested by your - browser:
- -First, your web browser requests a web page. The browser knows - to send the request to Privoxy, - which will in turn, relay the request to the remote web server - after passing the following tests:
-Privoxy traps any request for - its own internal CGI pages (e.g http://p.p/) and sends the CGI page back to the - browser.
-Next, Privoxy checks to see if - the URL matches any "+block" patterns. If so, the URL is then - blocked, and the remote web server will not be contacted. "+handle-as-image" and "+handle-as-empty-document" are then checked, - and if there is no match, an HTML "BLOCKED" page is sent back to the browser. - Otherwise, if it does match, an image is returned for the former, - and an empty text document for the latter. The type of image would - depend on the setting of "+set-image-blocker" (blank, checkerboard - pattern, or an HTTP redirect to an image elsewhere).
-Untrusted URLs are blocked. If URLs are being added to the - trust file, then that is done.
-If the URL pattern matches the "+fast-redirects" action, it is then processed. - Unwanted parts of the requested URL are stripped.
-Now the rest of the client browser's request headers are - processed. If any of these match any of the relevant actions (e.g. - "+hide-user-agent", etc.), headers are - suppressed or forged as determined by these actions and their - parameters.
-Now the web server starts sending its response back (i.e. - typically a web page).
-First, the server headers are read and processed to determine, - among other things, the MIME type (document type) and encoding. The - headers are then filtered as determined by the "+crunch-incoming-cookies", "+session-cookies-only", and "+downgrade-http-version" actions.
-If any "+filter" action or "+deanimate-gifs" action applies (and the - document type fits the action), the rest of the page is read into - memory (up to a configurable limit). Then the filter rules (from - default.filter and any other filter - files) are processed against the buffered content. Filters are - applied in the order they are specified in one of the filter files. - Animated GIFs, if present, are reduced to either the first or last - frame, depending on the action setting.The entire page, which is - now filtered, is then sent by Privoxy back to your browser.
- -If neither a "+filter" action or "+deanimate-gifs" matches, then Privoxy passes the raw data through to the - client browser as it becomes available.
-As the browser receives the now (possibly filtered) page - content, it reads and then requests any URLs that may be embedded - within the page source, e.g. ad images, stylesheets, JavaScript, - other HTML documents (e.g. frames), sounds, etc. For each of these - objects, the browser issues a separate request (this is easily - viewable in Privoxy's logs). And - each such request is in turn processed just as above. Note that a - complex web page will have many, many such embedded URLs. If these - secondary requests are to a different server, then quite possibly a - very differing set of actions is triggered.
-NOTE: This is somewhat of a simplistic overview of what happens with - each URL request. For the sake of brevity and simplicity, we have - focused on Privoxy's core features - only.
-The way Privoxy applies actions and filters to any given URL can be complex, - and not always so easy to understand what is happening. And sometimes - we need to be able to see just what Privoxy is doing. Especially, if something - Privoxy is doing is causing us a - problem inadvertently. It can be a little daunting to look at the - actions and filters files themselves, since they tend to be filled with - regular expressions whose - consequences are not always so obvious.
- -One quick test to see if Privoxy is - causing a problem or not, is to disable it temporarily. This should be - the first troubleshooting step (be sure to flush caches afterward!). - Looking at the logs is a good idea too. (Note that both the toggle - feature and logging are enabled via config - file settings, and may need to be turned "on".)
- -Another easy troubleshooting step to try is if you have done any - customization of your installation, revert back to the installed - defaults and see if that helps. There are times the developers get - complaints about one thing or another, and the problem is more related - to a customized configuration issue.
- -Privoxy also provides the http://config.privoxy.org/show-url-info page that can show - us very specifically how actions are - being applied to any given URL. This is a big help for - troubleshooting.
- -First, enter one URL (or partial URL) at the prompt, and then - Privoxy will tell us how the current - configuration will handle it. This will not help with filtering effects - (i.e. the "+filter" action) from one of the filter files since - this is handled very differently and not so easy to trap! It also will - not tell you about any other URLs that may be embedded within the URL - you are testing. For instance, images such as ads are expressed as URLs - within the raw page source of HTML pages. So you will only get info for - the actual URL that is pasted into the prompt area -- not any sub-URLs. - If you want to know about embedded URLs like ads, you will have to dig - those out of the HTML source. Use your browser's "View Page Source" option for this. Or right click on - the ad, and grab the URL.
- -Let's try an example, google.com, and look at it one section at a time in a sample - configuration (your real configuration may vary):
- -
- + |