7 CONTENT="Modular DocBook HTML Stylesheet Version 1.60"><LINK
9 TITLE="Privoxy User Manual"
10 HREF="index.html"><LINK
13 HREF="seealso.html"><LINK
16 HREF="../p_doc.css"></HEAD
35 >Privoxy User Manual</TH
75 >9.1. Regular Expressions</A
83 >"regular expressions"</SPAN
85 in various config files. Assuming support for <SPAN
89 Compatible Regular Expressions) is compiled in, which is the default. Such
90 configuration directives do not require regular expressions, but they can be
91 used to increase flexibility by matching a pattern with wild-cards against
94 > If you are reading this, you probably don't understand what <SPAN
98 > are, or what they can do. So this will be a very brief
99 introduction only. A full explanation would require a book ;-)</P
103 >"Regular expressions"</SPAN
104 > is a way of matching one character
105 expression against another to see if it matches or not. One of the
109 > is a literal string of readable characters
110 (letter, numbers, etc), and the other is a complex string of literal
111 characters combined with wild-cards, and other special characters, called
112 meta-characters. The <SPAN
114 >"meta-characters"</SPAN
115 > have special meanings and
116 are used to build the complex pattern to be matched against. Perl Compatible
117 Regular Expressions is an enhanced form of the regular expression language
118 with backward compatibility.</P
120 > To make a simple analogy, we do something similar when we use wild-card
121 characters when listing files with the <B
128 > matches all filenames. The <SPAN
132 character here is the asterisk which matches any and all characters. We can be
133 more specific and use <TT
136 > to match just individual
139 >"dir file?.text"</SPAN
147 >, etc. We are pattern
148 matching, using a similar technique to <SPAN
150 >"regular expressions"</SPAN
153 > Regular expressions do essentially the same thing, but are much, much more
154 powerful. There are many more <SPAN
156 >"special characters"</SPAN
158 building complex patterns however. Let's look at a few of the common ones,
159 and then some examples:</P
170 > - Matches any single character, e.g. <SPAN
203 > - The preceding character or expression is matched ZERO or ONE
221 > - The preceding character or expression is matched ONE or MORE
239 > - The preceding character or expression is matched ZERO or MORE
260 > character denotes that
261 the following character should be taken literally. This is used where one of the
262 special characters (e.g. <SPAN
265 >) needs to be taken literally and
266 not as a special meta-character.
283 > - Characters enclosed in brackets will be matched if
284 any of the enclosed characters are encountered.
301 > - parentheses are used to group a sub-expression,
302 or multiple sub-expressions.
322 > character works like an
326 > conditional statement. A match is successful if the
327 sub-expression on either side of <SPAN
346 >s/string1/string2/g</I
347 > - This is used to rewrite strings of text.
351 > is replaced by <SPAN
363 > These are just some of the ones you are likely to use when matching URLs with
367 >, and is a long way from a definitive
368 list. This is enough to get us started with a few simple examples which may
369 be more illuminating:</P
378 that uses the common combination of <SPAN
385 denote any character, zero or more times. In other words, any string at all.
386 So we start with a literal forward slash, then our regular expression pattern
390 >) another literal forward slash, the string
394 >, another forward slash, and lastly another
399 a directory path here. This will match any file with the path that has a
400 directory named <SPAN
407 any characters, and this could conceivably be more forward slashes, so it
408 might expand into a much longer looking path. For example, this could match:
411 >"/eye/hate/spammers/banners/annoy_me_please.gif"</SPAN
415 >"/banners/annoying.html"</SPAN
416 >, or almost an infinite number of other
417 possible combinations, just so it has <SPAN
423 > A now something a little more complex:</P
429 >/.*/adv((er)?ts?|ertis(ing|ements?))?/</TT
432 We have several literal forward slashes again (<SPAN
436 building another expression that is a file path statement. We have another
440 >, so we are matching against any conceivable sub-path, just so
441 it matches our expression. The only true literal that <I
445 > our pattern is <SPAN
449 the forward slashes. What comes after the <SPAN
453 interesting part. </P
458 > means the preceding expression (either a
459 literal character or anything grouped with <SPAN
463 can exist or not, since this means either zero or one match. So
466 >"((er)?ts?|ertis(ing|ements?))"</SPAN
467 > is optional, as are the
468 individual sub-expressions: <SPAN
474 >"(ing|ements?)"</SPAN
485 >. We have two of those. For instance,
488 >"(ing|ements?)"</SPAN
489 >, can expand to match either <SPAN
499 >. What is being done here, is an
500 attempt at matching as many variations of <SPAN
502 >"advertisement"</SPAN
504 similar, as possible. So this would expand to match just <SPAN
520 >"advertisement"</SPAN
524 >"advertisements"</SPAN
525 >. You get the idea. But it would not match
528 >"advertizements"</SPAN
532 >). We could fix that by
533 changing our regular expression to:
536 >"/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/"</SPAN
537 >, which would then match
544 >/.*/advert[0-9]+\.(gif|jpe?g)</TT
547 another path statement with forward slashes. Anything in the square brackets
551 > can be matched. This is using <SPAN
555 shorthand expression to mean any digit one through nine. It is the same as
559 >. So any digit matches. The <SPAN
563 means one or more of the preceding expression must be included. The preceding
564 expression here is what is in the square brackets -- in this case, any digit
565 one through nine. Then, at the end, we have a grouping: <SPAN
569 This includes a <SPAN
572 >, so this needs to match the expression on
573 either side of that bar character also. A simple <SPAN
576 > on one side, and the other
577 side will in turn match either <SPAN
587 > means the letter <SPAN
591 can be matched once or not at all. So we are building an expression here to
592 match image GIF or JPEG type image file. It must include the literal
596 >, then one or more digits, and a <SPAN
600 (which is now a literal, and not a special character, since it is escaped
604 >), and lastly either <SPAN
614 >. Some possible matches would
617 >"//advert1.jpg"</SPAN
621 >"/nasty/ads/advert1234.gif"</SPAN
625 >"/banners/from/hell/advert99.jpg"</SPAN
626 >. It would not match
630 > (no leading slash), or
633 >"/adverts232.jpg"</SPAN
634 > (the expression does not include an
640 >"/advert1.jsp"</SPAN
645 in the expression anywhere).</P
651 >s/microsoft(?!.com)/MicroSuck/i</TT
654 a substitution. <SPAN
657 > will replace any occurrence of
664 > at the end of the expression
665 means ignore case. The <SPAN
669 the match should fail if <SPAN
676 >. In other words, this acts like a <SPAN
680 modifier. In case this is a hyperlink, we don't want to break it ;-).</P
682 > We are barely scratching the surface of regular expressions here so that you
683 can understand the default <SPAN
687 configuration files, and maybe use this knowledge to customize your own
688 installation. There is much, much more that can be done with regular
689 expressions. Now that you know enough to get started, you can learn more on
692 > More reading on Perl Compatible Regular expressions:
694 HREF="http://www.perldoc.com/perl5.6/pod/perlre.html"
696 >http://www.perldoc.com/perl5.6/pod/perlre.html</A
708 >'s Internal Pages</A
714 > proxies each requested
715 web page, it is easy for <SPAN
719 trap certain special URLs. In this way, we can talk directly to
724 configured, see how our rules are being applied, change these
725 rules and other configuration options, and even turn
729 > filtering off, all with
730 a web browser. </P
732 > The URLs listed below are the special ones that allow direct access
740 > must be running to access these. If
741 not, you will get a friendly error message. Internet access is not
760 HREF="http://config.privoxy.org/"
762 >http://config.privoxy.org/</A
767 > Alternately, this may be reached at <A
772 variation may not work as reliably as the above in some configurations.
778 Show information about the current configuration:
788 HREF="http://config.privoxy.org/show-status"
790 >http://config.privoxy.org/show-status</A
798 Show the source code version numbers:
808 HREF="http://config.privoxy.org/show-version"
810 >http://config.privoxy.org/show-version</A
818 Show the client's request headers:
828 HREF="http://config.privoxy.org/show-request"
830 >http://config.privoxy.org/show-request</A
838 Show which actions apply to a URL and why:
848 HREF="http://config.privoxy.org/show-url-info"
850 >http://config.privoxy.org/show-url-info</A
858 Toggle Privoxy on or off. In this case, <SPAN
862 to run, but only as a pass-through proxy, with no actions taking place:
872 HREF="http://config.privoxy.org/toggle"
874 >http://config.privoxy.org/toggle</A
879 > Short cuts. Turn off, then on:
889 HREF="http://config.privoxy.org/toggle?set=disable"
891 >http://config.privoxy.org/toggle?set=disable</A
903 HREF="http://config.privoxy.org/toggle?set=enable"
905 >http://config.privoxy.org/toggle?set=enable</A
913 Edit the actions list file:
923 HREF="http://config.privoxy.org/edit-actions"
925 >http://config.privoxy.org/edit-actions</A
933 > These may be bookmarked for quick reference. </P
940 >9.2.1. Bookmarklets</A
943 > Here are some bookmarklets to allow you to easily access a
947 > version of this page. They are designed for MS Internet
948 Explorer, but should work equally well in Netscape, Mozilla, and other
949 browsers which support JavaScript. They are designed to run directly from
950 your bookmarks - not by clicking the links below (although that will work for
953 > To save them, right-click the link and choose <SPAN
955 >"Add to Favorites"</SPAN
959 >"Add Bookmark"</SPAN
960 > (Netscape). You will get a warning that
963 >"may not be safe"</SPAN
964 > - just click OK. Then you can run the
965 Bookmarklet directly from your favourites/bookmarks. For even faster access,
966 you can put them on the <SPAN
969 > bar (IE) or the <SPAN
973 > (Netscape), and run them with a single click. </P
981 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=enabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
990 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=disabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
999 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=toggle','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1002 > (Toggles between enabled and disabled)
1008 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y','ijbstatus','width=250,height=2,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1010 >View Privoxy Status</A
1017 > Credit: The site which gave me the general idea for these bookmarklets is
1019 HREF="http://www.bookmarklets.com"
1021 >www.bookmarklets.com</A
1023 have more information about bookmarklets. </P
1032 >9.3. Anatomy of an Action</A
1045 > to any given URL can be complex, and not always so
1046 easy to understand what is happening. And sometimes we need to be able to
1054 doing. Especially, if something <SPAN
1058 is causing us a problem inadvertantly. It can be a little daunting to look at
1059 the actions and filters files themselves, since they tend to be filled with
1062 >"regular expressions"</SPAN
1063 > whose consequences are not always
1069 HREF="http://config.privoxy.org/show-url-info"
1071 >http://config.privoxy.org/show-url-info</A
1073 page that can show us very specifically how <SPAN
1077 are being applied to any given URL. This is a big help for troubleshooting.
1080 > First, enter one URL (or partial URL) at the prompt, and then
1085 how the current configuration will handle it. This will not
1086 help with filtering effects from the <TT
1090 also will not tell you about any other URLs that may be embedded within the
1091 URL you are testing. For instance, images such as ads are expressed as URLs
1092 within the raw page source of HTML pages. So you will only get info for the
1093 actual URL that is pasted into the prompt area -- not any sub-URLs. If you
1094 want to know about embedded URLs like ads, you will have to dig those out of
1095 the HTML source. Use your browser's <SPAN
1097 >"View Page Source"</SPAN
1099 for this. Or right click on the ad, and grab the URL.</P
1101 > Let's look at an example, <A
1102 HREF="http://google.com"
1106 one section at a time:</P
1116 > System default actions:
1118 { -add-header -block -deanimate-gifs -downgrade -fast-redirects -filter
1119 -hide-forwarded -hide-from -hide-referer -hide-user-agent -image
1120 -image-blocker -limit-connect -no-compression -no-cookies-keep
1121 -no-cookies-read -no-cookies-set -no-popups -vanilla-wafer -wafer }
1129 > This is the top section, and only tells us of the compiled in defaults. This
1130 is basically what <SPAN
1137 > defined, i.e. it does nothing. Every action
1138 is disabled. This is not particularly informative for our purposes here. OK,
1149 > Matches for http://google.com:
1151 { -add-header -block +deanimate-gifs -downgrade +fast-redirects
1152 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1153 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1154 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1155 -hide-user-agent -image +image-blocker{blank} +no-compression
1156 +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups
1157 -vanilla-wafer -wafer }
1160 { -no-cookies-keep -no-cookies-read -no-cookies-set }
1172 > This is much more informative, and tells us how we have defined our
1176 >, and which ones match for our example,
1180 >. The first grouping shows our default
1181 settings, which would apply to all URLs. If you look at your <SPAN
1185 file, this would be the section just below the <SPAN
1189 near the top. This applies to all URLs as signified by the single forward
1196 > These are the default actions we have enabled. But we can define additional
1197 actions that would be exceptions to these general rules, and then list
1198 specific URLs that these exceptions would apply to. Last match wins.
1199 Just below this then are two explict matches for <SPAN
1201 >".google.com"</SPAN
1203 The first is negating our various cookie blocking actions (i.e. we will allow
1204 cookies here). The second is allowing <SPAN
1206 >"fast-redirects"</SPAN
1208 that there is a leading dot here -- <SPAN
1210 >".google.com"</SPAN
1212 match any hosts and sub-domains, in the google.com domain also, such as
1215 >"www.google.com"</SPAN
1216 >. So, apparently, we have these actions defined
1217 somewhere in the lower part of our actions file, and
1221 > is referenced in these sections. </P
1223 > And now we pull it altogether in the bottom section and summarize how
1227 > is appying all its <SPAN
1244 > Final results:
1246 -add-header -block -deanimate-gifs -downgrade -fast-redirects
1247 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1248 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1249 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1250 -hide-user-agent -image +image-blocker{blank} -limit-connect +no-compression
1251 -no-cookies-keep -no-cookies-read -no-cookies-set +no-popups -vanilla-wafer
1260 > Now another example, <SPAN
1262 >"ad.doubleclick.net"</SPAN
1273 > { +block +image }
1288 > We'll just show the interesting part here, the explicit matches. It is
1289 matched three different times. Each as an <SPAN
1291 >"+block +image"</SPAN
1293 which is the expanded form of one of our aliases that had been defined as:
1296 >"+imageblock"</SPAN
1300 > are defined in the
1301 first section of the actions file and typically used to combine more
1302 than one action.)</P
1304 > Any one of these would have done the trick and blocked this as an unwanted
1305 image. This is unnecessarily redundant since the last case effectively
1306 would also cover the first. No point in taking chances with these guys
1307 though ;-) Note that if you want an ad or obnoxious
1308 URL to be invisible, it should be defined as <SPAN
1310 >"ad.doubleclick.net"</SPAN
1312 is done here -- as both a <SPAN
1322 >. The custom alias <SPAN
1324 >"+imageblock"</SPAN
1328 > One last example. Let's try <SPAN
1330 >"http://www.rhapsodyk.net/adsl/HOWTO/"</SPAN
1332 This one is giving us problems. We are getting a blank page. Hmmm...</P
1342 > Matches for http://www.rhapsodyk.net/adsl/HOWTO/:
1344 { -add-header -block +deanimate-gifs -downgrade +fast-redirects
1345 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1346 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1347 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1348 -hide-user-agent -image +image-blocker{blank} +no-compression
1349 +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups
1350 -vanilla-wafer -wafer }
1369 we did not want this at all! Now we see why we get the blank page. We could
1370 now add a new action below this that explictly does <I
1374 block (-block) pages with <SPAN
1377 >. There are various ways to
1378 handle such exceptions. Example:</P
1397 > Now the page displays ;-) Be sure to flush your browser's caches when
1398 making such changes. Or, try using <TT
1403 > But now what about a situation where we get no explicit matches like
1423 > That actually was very telling and pointed us quickly to where the problem
1424 was. If you don't get this kind of match, then it means one of the default
1425 rules in the first section is causing the problem. This would require some
1426 guesswork, and maybe a little trial and error to isolate the offending rule.
1427 One likely cause would be one of the <SPAN
1431 adding the URL for the site to one of aliases that turn off <SPAN
1446 .worldpay.com # for quietpc.com
1466 >"{ -filter -no-cookies -no-cookies-keep }"</SPAN
1468 your own exception to negate filtering: </P
1490 > is an alias that disables most actions. This can be
1491 used as a last resort for problem sites. Remember to flush caches! If this
1492 still does not work, you will have to go through the remaining actions one by
1493 one to find which one(s) is causing the problem.</P