X-Git-Url: http://www.privoxy.org/gitweb/?p=privoxy.git;a=blobdiff_plain;f=doc%2Fwebserver%2Fuser-manual%2Fappendix.html;h=56b7cf16676cbf8abe5efdcedcdaa3a2b28dfc24;hp=53b99fe1fc28629a90fc179099fe5adca8b18da2;hb=3c890b0540031fa87cc28514b3e4d0e23124fbcd;hpb=03472355cc98c0a5f3e65deb0e4569bd14e0fb54 diff --git a/doc/webserver/user-manual/appendix.html b/doc/webserver/user-manual/appendix.html index 53b99fe1..56b7cf16 100644 --- a/doc/webserver/user-manual/appendix.html +++ b/doc/webserver/user-manual/appendix.html @@ -1,20 +1,23 @@ + Appendix +HREF="../p_doc.css"> Privoxy 3.0.6 User ManualPrivoxy 3.0.27 User Manual14. Appendix14. Appendix

14.1. Regular Expressions

14.1. Regular Expressions

dir command in DOS. +> command in DOS. *.*"special characters" and ways of +> and ways of building complex patterns however. Let's look at a few of the common ones, and then some examples:

"escape" character denotes that - the following character should be taken literally. This is used where one of the + the following character should be taken literally. This is used where one of the special characters (e.g. ".""example\.com", makes - sure the period is recognized only as a period (and not expanded to its +>, makes + sure the period is recognized only as a period (and not expanded to its meta-character meaning of any single character).

"[0-9]" - matches any numeric digit (zero through nine). As an example, we can combine + matches any numeric digit (zero through nine). As an example, we can combine this with "+"

"/(this|that) example/" uses grouping and the bar character +> uses grouping and the bar character and would match either "this example"

These are just some of the ones you are likely to use when matching URLs with +> These are just some of the ones you are likely to use when matching URLs with Privoxy and "*" to +> to denote any character, zero or more times. In other words, any string at all. - So we start with a literal forward slash, then our regular expression pattern + So we start with a literal forward slash, then our regular expression pattern (".*"".*". We are building +>. We are building a directory path here. This will match any file with the path that has a directory named /.*/adv((er)?ts?|ertis(ing|ements?))?/ - +> - We have several literal forward slashes again ("/"), so we are - building another expression that is a file path statement. We have another + building another expression that is a file path statement. We have another ".*""adv" string is the - interesting part.

Remember the "or". We have two of those. For instance, +>. We have two of those. For instance, "(ing|ements?)", can expand to match either "ing" +> "advertisement", and +>, and similar, as possible. So this would expand to match just "adv""advertisements". You get the idea. But it would not match +>. You get the idea. But it would not match "advertizements""z"). We could fix that by - changing our regular expression to: + changing our regular expression to: "/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/"/.*/advert[0-9]+\.(gif|jpe?g) - Again - another path statement with forward slashes. Anything in the square brackets +> - Again + another path statement with forward slashes. Anything in the square brackets "[ ]""+" - means one or more of the preceding expression must be included. The preceding - expression here is what is in the square brackets -- in this case, any digit + means one or more of the preceding expression must be included. The preceding + expression here is what is in the square brackets -- in this case, any digit one through nine. Then, at the end, we have a grouping: "(gif|jpe?g)". +>. This includes a "|"

More reading on Perl Compatible Regular expressions: +> More reading on Perl Compatible Regular expressions:

14.2. Privoxy's Internal Pages

14.2. Privoxy's Internal Pages

Since Privoxy proxies each requested +> proxies each requested web page, it is easy for Privoxy to +> to trap certain special URLs. In this way, we can talk directly to Privoxy, and see how it is - configured, see how our rules are being applied, change these +>, and see how it is + configured, see how our rules are being applied, change these rules and other configuration options, and even turn Privoxy's filtering off, all with - a web browser.

filtering off, all with + a web browser.

The URLs listed below are the special ones that allow direct access +> The URLs listed below are the special ones that allow direct access to PrivoxyPrivoxy must be running to access these. If - not, you will get a friendly error message. Internet access is not +> must be running to access these. If + not, you will get a friendly error message. Internet access is not necessary either.

These may be bookmarked for quick reference. See next.

14.2.1. Bookmarklets

Below are some "bookmarklets" to allow you to easily access a - "mini" version of some of Privoxy's - special pages. They are designed for MS Internet Explorer, but should work - equally well in Netscape, Mozilla, and other browsers which support - JavaScript. They are designed to run directly from your bookmarks - not by - clicking the links below (although that should work for testing).

To save them, right-click the link and choose "Add to Favorites" - (IE) or "Add Bookmark" (Netscape). You will get a warning that - the bookmark "may not be safe" - just click OK. Then you can run the - Bookmarklet directly from your favorites/bookmarks. For even faster access, - you can put them on the "Links" bar (IE) or the "Personal - Toolbar" (Netscape), and run them with a single click.

Credit: The site which gave us the general idea for these bookmarklets is - www.bookmarklets.com. They - have more information about bookmarklets.

14.3. Chain of Events14.3. Chain of Events

Let's take a quick look at the basic sequence of events when a web page is - requested by your browser and Let's take a quick look at how some of Privoxy is on duty:

Privoxy's + core features are triggered, and the ensuing sequence of events when a web + page is requested by your browser:

NOTE: This is somewhat of a simplistic overview of what happens with each URL + request. For the sake of brevity and simplicity, we have focused on + Privoxy's core features only.

14.4. Troubleshooting: Anatomy of an Action14.4. Troubleshooting: Anatomy of an Action

The way Privoxy applies +> applies actionsregular expressions whose consequences are not - always so obvious.

One quick test to see if Privoxy is causing a problem - or not, is to disable it temporarily. This should be the first troubleshooting - step. See the Bookmarklets section on a quick - and easy way to do this (be sure to flush caches afterward!). Looking at the - logs is a good idea too.

is causing a problem + or not, is to disable it temporarily. This should be the first troubleshooting + step (be sure to flush caches afterward!). Looking at the + logs is a good idea too. (Note that both the toggle feature and logging are + enabled via config file settings, and may need to be + turned "on".)

Another easy troubleshooting step to try is if you have done any customization of your installation, revert back to the installed @@ -1356,7 +1240,7 @@ HREF="appendix.html#BOOKMARKLETS" > Privoxy also provides the +> also provides the Privoxy will tell us +> will tell us how the current configuration will handle it. This will not help with filtering effects (i.e. the google.com, - and look at it one section at a time in a sample configuration (your real +>, + and look at it one section at a time in a sample configuration (your real configuration may vary):

 Matches for http://google.com:
+> Matches for http://www.google.com:
 
  In file: default.action [ Edit ]
 
- {-add-header
- -block
- -content-type-overwrite
- -crunch-client-header
- -crunch-if-none-match
- -crunch-incoming-cookies
- -crunch-outgoing-cookies
- -crunch-server-header
+ {+change-x-forwarded-for{block}
  +deanimate-gifs {last}
- -downgrade-http-version
  +fast-redirects {check-decoded-url}
- -filter {js-events}
- -filter {content-cookies}
- -filter {all-popups}
- -filter {banners-by-link}
- -filter {tiny-textforms}
- -filter {frameset-borders}
- -filter {demoronizer}
- -filter {shockwave-flash}
- -filter {quicktime-kioskmode}
- -filter {fun}
- -filter {crude-parental}
- -filter {site-specifics}
- -filter {js-annoyances}
- -filter {html-annoyances}
  +filter {refresh-tags}
- -filter {unsolicited-popups}
  +filter {img-reorder}
  +filter {banners-by-size}
  +filter {webbugs}
  +filter {jumping-windows}
  +filter {ie-exploits}
- -filter {google}
- -filter {yahoo}
- -filter {msn}
- -filter {blogspot}
- -filter {xml-to-html}
- -filter {html-to-xml}
- -filter-client-headers
- -filter-server-headers
- -force-text-mode
- -handle-as-empty-document
- -handle-as-image
- -hide-accept-language
- -hide-content-disposition
- +hide-forwarded-for-headers
  +hide-from-header {block}
- -hide-if-modified-since
  +hide-referrer {forge}
- -hide-user-agent
- -inspect-jpegs
- -kill-popups
- -limit-connect
- -overwrite-last-modified
- +prevent-compression
- -redirect
- -send-vanilla-wafer
- -send-wafer
  +session-cookies-only
  +set-image-blocker {pattern}
- -treat-forbidden-connects-like-blocks }
 /
- 
+
  { -session-cookies-only }
  .google.com
 
@@ -1496,13 +1331,12 @@ CLASS="GUIBUTTON"
 CLASS="GUIBUTTON"
 >[ Edit ]
-(no matches in this file)  

This is telling us how we have defined our +> This is telling us how we have defined our "google.com". +>. Displayed is all the actions that are available to us. Remember, the . So some are "on" here, but many +> here, but many are "off" or "mail.google.com". But it would not +>. But it would not match "www.google.de"user.action file, we again have no hits. So there is nothing google-specific that we might have added to our own, local - configuration. If there was, those actions would over-rule any actions from + configuration. If there was, those actions would over-rule any actions from previously processed files, such as default.action is applying all its "actions" +> to "google.com":

:


 Final results:
- 
+> Final results:
+
  -add-header
  -block
+ +change-x-forwarded-for{block}
+ -client-header-filter{hide-tor-exit-notation}
  -content-type-overwrite
  -crunch-client-header
  -crunch-if-none-match
@@ -1663,7 +1498,7 @@ CLASS="SCREEN"
  -crunch-server-header
  +deanimate-gifs {last}
  -downgrade-http-version
- +fast-redirects {check-decoded-url}
+ -fast-redirects
  -filter {js-events}
  -filter {content-cookies}
  -filter {all-popups}
@@ -1689,37 +1524,29 @@ CLASS="SCREEN"
  -filter {yahoo}
  -filter {msn}
  -filter {blogspot}
- -filter {xml-to-html}
- -filter {html-to-xml}
- -filter-client-headers
- -filter-server-headers
+ -filter {no-ping}
  -force-text-mode
  -handle-as-empty-document
  -handle-as-image
  -hide-accept-language
  -hide-content-disposition
- +hide-forwarded-for-headers
  +hide-from-header {block}
  -hide-if-modified-since
  +hide-referrer {forge}
  -hide-user-agent
- -inspect-jpegs
- -kill-popups
  -limit-connect
  -overwrite-last-modified
- +prevent-compression
+ -prevent-compression
  -redirect
- -send-vanilla-wafer
- -send-wafer
+ -server-header-filter{xml-to-html}
+ -server-header-filter{html-to-xml}
  -session-cookies-only
- +set-image-blocker {pattern}
- -treat-forbidden-connects-like-blocks 

Notice the only difference here to the previous listing, is to +> Notice the only difference here to the previous listing, is to "fast-redirects""session-cookies-only", - which are activated specifically for this site in our configuration, + which are activated specifically for this site in our configuration, and thus show in the "Final Results""ad.doubleclick.net":


 { +block }
+> { +block{Domains starts with "ad"} }
   ad*.
 
- { +block }
+ { +block{Domain contains "ad"} }
   .ad.
 
- { +block +handle-as-image }
+ { +block{Doubleclick banner server} +handle-as-image }
   .[a-vx-z]*.doubleclick.net

We'll just show the interesting part here - the explicit matches. It is +> We'll just show the interesting part here - the explicit matches. It is matched three different times. Two "+block" sections, +>"+block{}" sections, and a "+block +handle-as-image""+block{} +handle-as-image", - which is the expanded form of one of our aliases that had been defined as: + which is the expanded form of one of our aliases that had been defined as: "+block-as-image""Aliases" are defined in - the first section of the actions file and typically used to combine more + the first section of the actions file and typically used to combine more than one action.)

Any one of these would have done the trick and blocked this as an unwanted - image. This is unnecessarily redundant since the last case effectively - would also cover the first. No point in taking chances with these guys - though ;-) Note that if you want an ad or obnoxious +> Any one of these would have done the trick and blocked this as an unwanted + image. This is unnecessarily redundant since the last case effectively + would also cover the first. No point in taking chances with these guys + though ;-) Note that if you want an ad or obnoxious URL to be invisible, it should be defined as "ad.doubleclick.net""+block""+block{}" and an +> an "http://www.example.net/adsl/HOWTO/". This one is giving us problems. We are getting a blank page. Hmmm ...


 Matches for http://www.example.net/adsl/HOWTO/:
+> Matches for http://www.example.net/adsl/HOWTO/:
 
  In file: default.action [ Edit ]
 
- {-add-header 
+ {-add-header
   -block
+  +change-x-forwarded-for{block}
+  -client-header-filter{hide-tor-exit-notation}
   -content-type-overwrite
   -crunch-client-header
   -crunch-if-none-match
   -crunch-incoming-cookies
   -crunch-outgoing-cookies
   -crunch-server-header
-  +deanimate-gifs 
-  -downgrade-http-version 
+  +deanimate-gifs
+  -downgrade-http-version
   +fast-redirects {check-decoded-url}
   -filter {js-events}
   -filter {content-cookies}
@@ -1880,37 +1706,29 @@ CLASS="GUIBUTTON"
   -filter {yahoo}
   -filter {msn}
   -filter {blogspot}
-  -filter {xml-to-html}
-  -filter {html-to-xml}
-  -filter-client-headers
-  -filter-server-headers
+  -filter {no-ping}
   -force-text-mode
   -handle-as-empty-document
-  -handle-as-image 
+  -handle-as-image
   -hide-accept-language
-  -hide-content-disposition  
-  +hide-forwarded-for-headers 
-  +hide-from-header{block} 
-  +hide-referer{forge} 
-  -hide-user-agent 
-  -inspect-jpegs
-  -kill-popups 
+  -hide-content-disposition
+  +hide-from-header{block}
+  +hide-referer{forge}
+  -hide-user-agent
   -overwrite-last-modified
-  +prevent-compression 
+  +prevent-compression
   -redirect
-  -send-vanilla-wafer 
-  -send-wafer 
-  +session-cookies-only 
-  +set-image-blocker{blank} 
-  -treat-forbidden-connects-like-blocks }
+  -server-header-filter{xml-to-html}
+  -server-header-filter{html-to-xml}
+  +session-cookies-only
+  +set-image-blocker{blank} }
    /
 
- { +block +handle-as-image }
+ { +block{Path contains "ads".} +handle-as-image }
   /ads

Ooops, the is matching "/ads" in our +> in our configuration! But we did not want this at all! Now we see why we get the - blank page. It is actually triggering two different actions here, and + blank page. It is actually triggering two different actions here, and the effects are aggregated so that the URL is blocked, and Privoxy is told +> is told to treat the block as if it were an image. But this is, of course, all wrong. We could now add a new action below this (or better in our own "adsl" in them (remember, last match in the configuration wins). There are various ways to handle such exceptions. Example:


 { -block }
+> { -block }
   /adsl

Now the page displays ;-) +> Now the page displays ;-) Remember to flush your browser's caches when making these kinds of changes to your configuration to insure that you get a freshly delivered page! Or, try using Shift+Reload.

But now what about a situation where we get no explicit matches like +> But now what about a situation where we get no explicit matches like we did with:


 { +block +handle-as-image }
+> { +block{Path starts with "ads".} +handle-as-image }
  /ads

That actually was very helpful and pointed us quickly to where the problem - was. If you don't get this kind of match, then it means one of the default + was. If you don't get this kind of match, then it means one of the default rules in the first section of default.action"+filter":


 { shop }
+> { shop }
  .quietpc.com
  .worldpay.com   # for quietpc.com
  .jungle.com
@@ -2034,7 +1847,6 @@ CLASS="SCREEN"
 >

is an "alias" that expands to +> that expands to "{ -filter -session-cookies-only }". - Or you could do your own exception to negate filtering:


 { -filter }
+> { -filter }
  # Disable ALL filter actions for sites in this section
  .forbes.com
  developer.ibm.com
@@ -2071,7 +1882,6 @@ CLASS="SCREEN"
 >

This would turn off all filtering for these sites. This is best put in user.action, for local site exceptions. Note that when a simple domain pattern is used by itself (without - the subsequent path portion), all sub-pages within that domain are included - automatcially in the scope of the action.

Images that are inexplicably being blocked, may well be hitting the +> Images that are inexplicably being blocked, may well be hitting the "+filter{banners-by-size}" - rule, which assumes - that images of certain sizes are ad banners (works well + rule, which assumes + that images of certain sizes are ad banners (works well " is an alias that disables most actions that are the most likely to cause trouble. This can be used as a - last resort for problem sites.


 { fragile }
+> { fragile }
  # Handle with care: easy to break
  mail.google.
  mybank.example.com

Remember to flush caches! Note that the +> Note that the mail.google reference lacks the TLD portion (e.g. +> reference lacks the TLD portion (e.g. ".com". This will effectively match any TLD with +>). This will effectively match any TLD with google in it, such as mail.google.de, +>mail.google.de., just as an example.

- If this still does not work, you will have to go through the remaining +> If this still does not work, you will have to go through the remaining actions one by one to find which one(s) is causing the problem.