7 CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
10 TITLE="Privoxy 3.0.6 User Manual"
11 HREF="index.html"><LINK
14 HREF="actions-file.html"><LINK
16 TITLE="Privoxy's Template Files"
17 HREF="templates.html"><LINK
21 <LINK REL="STYLESHEET" TYPE="text/css" HREF="p_doc.css">
33 SUMMARY="Header navigation table"
42 >Privoxy 3.0.6 User Manual</TH
50 HREF="actions-file.html"
82 > On-the-fly text substitutions that can be invoked through the
86 HREF="actions-file.html#FILTER"
90 to be defined in a <SPAN
94 can then be invoked as an <SPAN
97 >. Multiple filter files can be
98 defined through the <TT
101 HREF="config.html#FILTERFILE"
104 > config directive. The filters
105 as supplied by the developers will be found in
109 >. It is recommended that any locally
110 defined or modified filters go in a separately defined file such as
117 > Typical reasons for doing these kinds of substitutions are to eliminate
118 common annoyances in HTML and JavaScript, such as pop-up windows,
119 exit consoles, crippled windows without navigation tools, the
120 infamous <BLINK> tag etc, to suppress images with certain
121 width and height attributes (standard banner sizes or web-bugs),
122 or just to have fun. The possibilities are endless.</P
124 > Filtering works on any text-based document type, including
125 HTML, JavaScript, CSS etc. (all <TT
139 Substitutions are made at the source level, so if you want to <SPAN
143 > filters, you should first be familiar with HTML syntax,
144 and, of course, regular expressions. By default, filters are only applied
145 to the raw document content, but can be extended to the HTTP headers with
146 the supplemental actions:
148 HREF="actions-file.html#FILTER-CLIENT-HEADERS"
149 >filter-client-headers</A
152 HREF="actions-file.html#FILTER-SERVER-HEADERS"
153 >filter-server-headers</A
157 HREF="actions-file.html"
160 filter file is organized in sections, which are called <SPAN
167 here. Each filter consists of a heading line, that starts with the
184 >, and a short (one line)
191 > of what it does. Below that line
198 >, i.e. lines that define the actual
199 text substitutions. By convention, the name of a filter
200 should describe what the filter <SPAN
207 comment is used in the <A
208 HREF="http://config.privoxy.org/"
214 > Once a filter called <TT
220 in the filter file, it can be invoked by using an action of the form
224 HREF="actions-file.html#FILTER"
234 HREF="actions-file.html"
238 > A filter header line for a filter called <SPAN
252 >FILTER: foo Replace all "foo" with "bar"</PRE
258 > Below that line, and up to the next header line, come the jobs that
259 define what text replacements the filter executes. They are specified
260 in a syntax that imitates <A
261 HREF="http://www.perl.org/"
268 > operator. If you are familiar with Perl, you
269 will find this to be quite intuitive, and may want to look at the
270 PCRS documentation for the subtle differences to Perl behaviour. Most
271 notably, the non-standard option letter <TT
275 which turns the default to ungreedy matching.</P
279 HREF="http://en.wikipedia.org/wiki/Regular_expressions"
286 >, you might want to take a look at
288 HREF="appendix.html#REGEX"
289 >Appendix on regular expressions</A
292 HREF="http://perldoc.perl.org/perlre.html"
298 HREF="http://perldoc.perl.org/perlop.html"
304 > operator's syntax</A
306 HREF="http://perldoc.perl.org/perlre.html"
311 The below examples might also help to get you started.</P
319 >9.1. Filter File Tutorial</H2
321 > Now, let's complete our <SPAN
324 > filter. We have already defined
325 the heading, but the jobs are still missing. Since all it does is to replace
332 >, there is only one (trivial) job
349 > But wait! Didn't the comment say that <SPAN
359 > should be replaced? Our current job will only take
360 care of the first <SPAN
363 > on each page. For global substitution,
364 we'll need to add the <TT
383 > Our complete filter now looks like this:</P
393 >FILTER: foo Replace all "foo" with "bar"
400 > Let's look at some real filters for more interesting examples. Here you see
401 a filter that protects against some common annoyances that arise from JavaScript
402 abuse. Let's look at its jobs one after the other:</P
412 >FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
414 # Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
416 s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg</PRE
422 > Following the header line and a comment, you see the job. Note that it uses
426 > as the delimiter instead of <TT
430 the pattern contains a forward slash, which would otherwise have to be escaped
436 > Now, let's examine the pattern: it starts with the text <TT
440 enclosed in parentheses. Since the dot matches any character, and <TT
446 >"Match an arbitrary number of the element left of myself"</SPAN
458 it matches the whole page, from the start of the first <script> tag.</P
460 > That's more than we want, but the pattern continues: <TT
462 >document\.referrer</TT
464 matches only the exact string <SPAN
466 >"document.referrer"</SPAN
474 >, i.e. preceded by a backslash, to take away its
475 special meaning as a joker, and make it just a regular dot. So far, the meaning is:
476 Match from the start of the first <script> tag in a the page, up to, and including,
479 >"document.referrer"</SPAN
487 in the page (and appear in that order).</P
489 > But there's still more pattern to go. The next element, again enclosed in parentheses,
492 >.*</script></TT
493 >. You already know what <TT
497 means, so the whole pattern translates to: Match from the start of the first <script>
498 tag in a page to the end of the last <script> tag, provided that the text
501 >"document.referrer"</SPAN
502 > appears somewhere in between.</P
504 > This is still not the whole story, since we have ignored the options and the parentheses:
505 The portions of the page matched by sub-patterns that are enclosed in parentheses, will be
506 remembered and be available through the variables <TT
510 the substitute. The <TT
513 > option switches to ungreedy matching, which means
517 > in the pattern will only <SPAN
521 text in between <SPAN
533 >"document.referrer"</SPAN
534 >, and that the second <TT
538 only span the text up to the <SPAN
546 >"</script>"</SPAN
548 tag. Furthermore, the <TT
551 > option says that the match may span
552 multiple lines in the page, and the <TT
555 > option again means that the
556 substitution is global.</P
558 > So, to summarize, the pattern means: Match all scripts that contain the text
561 >"document.referrer"</SPAN
562 >. Remember the parts of the script from
563 (and including) the start tag up to (and excluding) the string
566 >"document.referrer"</SPAN
570 >, and the part following
571 that string, up to and including the closing tag, as <TT
576 > Now the pattern is deciphered, but wasn't this about substituting things? So
577 lets look at the substitute: <TT
579 >$1"Not Your Business!"$2</TT
581 easy to read: The text remembered as <TT
587 >"Not Your Business!"</TT
595 the quotation marks!), followed by the text remembered as <TT
599 This produces an exact copy of the original string, with the middle part
602 >"document.referrer"</SPAN
609 > The whole job now reads: Replace <SPAN
611 >"document.referrer"</SPAN
615 >"Not Your Business!"</TT
616 > wherever it appears inside a
617 <script> tag. Note that this job won't break JavaScript syntax,
618 since both the original and the replacement are syntactically valid
619 string objects. The script just won't have access to the referrer
620 information anymore.</P
622 > We'll show you two other jobs from the JavaScript taming department, but
623 this time only point out the constructs of special interest:</P
633 ># The status bar is for displaying link targets, not pointless blahblah
635 s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig</PRE
644 > stands for whitespace characters (space, tab, newline,
645 carriage return, form feed), so that <TT
651 or more whitespace"</SPAN
659 makes this matching of arbitrary text ungreedy. (Note that the <TT
663 option is not set). The <TT
666 > construct means: <SPAN
675 > a double quote"</SPAN
680 a back-reference to the first parenthesis just like <TT
684 with the difference that in the <SPAN
690 >, a backslash indicates
691 a back-reference, whereas in the <SPAN
697 >, it's the dollar.</P
699 > So what does this job do? It replaces assignments of single- or double-quoted
702 >"window.status"</SPAN
703 > object with a dummy assignment
704 (using a variable name that is hopefully odd enough not to conflict with
705 real variables in scripts). Thus, it catches many cases where e.g. pointless
706 descriptions are displayed in the status bar instead of the link target when
707 you move your mouse over links.</P
717 ># Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
719 s/(<body [^>]*)onunload(.*>)/$1never$2/iU</PRE
727 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
731 > in the HTML DOM was a <SPAN
738 When I close a browser window, I want it to close and die. Basta.
739 This job replaces the <SPAN
745 >"<body>"</SPAN
746 > tags with the dummy word <TT
753 > option makes the pattern matching
754 case-insensitive. Also note that ungreedy matching alone doesn't always guarantee
755 a minimal match: In the first parenthesis, we had to use <TT
762 > to prevent the match from exceeding the
763 <body> tag if it doesn't contain <SPAN
769 > The last example is from the fun department:</P
779 >FILTER: fun Fun text replacements
781 # Spice the daily news:
783 s/microsoft(?!\.com)/MicroSuck/ig</PRE
792 > part (a so-called negative lookahead)
793 in the job's pattern, which means: Don't match, if the string
797 > appears directly following <SPAN
801 in the page. This prevents links to microsoft.com from being trashed, while
802 still replacing the word everywhere else.</P
812 ># Buzzword Bingo (example for extended regex syntax)
814 s* industry[ -]leading \
816 | customer[ -]focused \
818 | award[ -]winning # Comments are OK, too! \
819 | high[ -]performance \
820 | solutions[ -]based \
824 *<font color="red"><b>BINGO!</b></font> \
834 > option in this job turns on extended syntax, and allows for
835 e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting. </P
837 > You get the idea?</P
844 NAME="PREDEFINED-FILTERS"
846 >9.2. The Pre-defined Filters</H2
848 >The distribution <TT
851 > file contains a selection of
852 pre-defined filters for your convenience:</P
868 > The purpose of this filter is to get rid of particularly annoying JavaScript abuse.
875 > replaces JavaScript references to the browser's referrer information
876 with the string "Not Your Business!". This compliments the <TT
879 HREF="actions-file.html#HIDE-REFERRER"
882 > action on the content level.
887 > removes the bindings to the DOM's
889 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
893 > which we feel has no right to exist and is responsible for most <SPAN
895 >"exit consoles"</SPAN
897 nasty windows that pop up when you close another one.
902 > removes code that causes new windows to be opened with undesired properties, such as being
903 full-screen, non-resizeable, without location, status or menu bar etc.
910 > Use with caution. This is an aggressive filter, and can break sites that
911 rely heavily on JavaScript.
924 > This is a very radical measure. It removes virtually all JavaScript event bindings, which
925 means that scripts can not react to user actions such as mouse movements or clicks, window
926 resizing etc, anymore. Use with caution!
933 >strongly discourage</I
935 > using this filter as a default since it breaks
936 many legitimate scripts. It is meant for use only on extra-nasty sites (should you really
950 > This filter will undo many common instances of HTML based abuse.
960 are neutralized (yeah baby!), and browser windows will be created as
961 resizeable (as of course they should be!), and will have location,
962 scroll and menu bars -- even if specified otherwise.
975 > Most cookies are set in the HTTP dialog, where they can be intercepted
980 HREF="actions-file.html#CRUNCH-INCOMING-COOKIES"
981 >crunch-incoming-cookies</A
987 HREF="actions-file.html#CRUNCH-OUTGOING-COOKIES"
988 >crunch-outgoing-cookies</A
991 actions. But web sites increasingly make use of HTML meta tags and JavaScript
992 to sneak cookies to the browser on the content level.
995 > This filter disables most HTML and JavaScript code that reads or sets
996 cookies. It cannot detect all clever uses of these types of code, so it
997 should not be relied on as an absolute fix. Use it wherever you would also
998 use the cookie crunch actions.
1011 > Disable any refresh tags if the interval is greater than nine seconds (so
1012 that redirections done via refresh tags are not destroyed). This is useful
1013 for dial-on-demand setups, or for those who find this HTML feature
1022 >unsolicited-popups</I
1027 > This filter attempts to prevent only <SPAN
1029 >"unsolicited"</SPAN
1031 windows from opening, yet still allow pop-up windows that the user
1032 has explicitly chosen to open. It was added in version 3.0.1,
1033 as an improvement over earlier such filters.
1036 > Technical note: The filter works by redefining the window.open JavaScript
1037 function to a dummy function, <TT
1039 >PrivoxyWindowOpen()</TT
1041 during the loading and rendering phase of each HTML page access, and
1042 restoring the function afterward.
1045 > This is recommended only for browsers that cannot perform this function
1046 reliably themselves. And be aware that some sites require such windows
1047 in order to function normally. Use with caution.
1060 > Attempt to prevent <SPAN
1066 > pop-up windows from opening.
1067 Note this should be used with even more discretion than the above, since
1068 it is more likely to break some sites that require pop-ups for normal
1069 usage. Use with caution.
1082 > This is a helper filter that has no value if used alone. It makes the
1085 >banners-by-size</TT
1088 >banners-by-link</TT
1090 (see below) filters more effective and should be enabled together with them.
1103 > This filter removes image tags purely based on what size they are. Fortunately
1104 for us, many ads and banner images tend to conform to certain standardized
1105 sizes, which makes this filter quite effective for ad stripping purposes.
1108 > Occasionally this filter will cause false positives on images that are not ads,
1109 but just happen to be of one of the standard banner sizes.
1112 > Recommended only for those who require extreme ad blocking. The default
1113 block rules should catch 95+% of all ads <SPAN
1119 > this filter enabled.
1132 > This is an experimental filter that attempts to kill any banners if
1133 their URLs seem to point to known or suspected click trackers. It is currently
1134 not of much value and is not recommended for use by default.
1147 > Webbugs are small, invisible images (technically 1X1 GIF images), that
1148 are used to track users across websites, and collect information on them.
1149 As an HTML page is loaded by the browser, an embedded image tag causes the
1150 browser to contact a third-party site, disclosing the tracking information
1151 through the requested URL and/or cookies for that third-party domain, without
1152 the user ever becoming aware of the interaction with the third-party site.
1153 HTML-ized spam also uses a similar technique to verify email addresses.
1156 > This filter removes the HTML code that loads such <SPAN
1172 > A rather special-purpose filter that can be used to enlarge textareas (those
1173 multi-line text boxes in web forms) and turn off hard word wrap in them.
1174 It was written for the sourceforge.net tracker system where such boxes are
1175 a nuisance, but it can be handy on other sites, too.
1178 > It is not recommended to use this filter as a default.
1191 > Many consider windows that move, or resize themselves to be abusive. This filter
1192 neutralizes the related JavaScript code. Note that some sites might not display
1193 or behave as intended when using this filter. Use with caution.
1201 >frameset-borders</I
1206 > Some web designers seem to assume that everyone in the world will view their
1207 web sites using the same browser brand and version, screen resolution etc,
1208 because only that assumption could explain why they'd use static frame sizes,
1209 yet prevent their frames from being resized by the user, should they be too
1210 small to show their whole content.
1213 > This filter removes the related HTML code. It should only be applied to sites
1227 > Many Microsoft products that generate HTML use non-standard extensions (read:
1228 violations) of the ISO 8859-1 aka Latin-1 character set. This can cause those
1229 HTML documents to display with errors on standard-compliant platforms.
1232 > This filter translates the MS-only characters into Latin-1 equivalents.
1233 It is not necessary when using MS products, and will cause corruption of
1234 all documents that use 8-bit character sets other than Latin-1. It's mostly
1235 worthwhile for Europeans on non-MS platforms, if weird garbage characters
1236 sometimes appear on some pages, or user agents that don't correct for this on
1251 > A filter for shockwave haters. As the name suggests, this filter strips code
1252 out of web pages that is used to embed shockwave flash objects.
1262 >quicktime-kioskmode</I
1267 > Change HTML code that embeds Quicktime objects so that kioskmode, which
1268 prevents saving, is disabled.
1281 > Text replacements for subversive browsing fun. Make fun of your favorite
1282 Monopolist or play buzzword bingo.
1295 > A demonstration-only filter that shows how <SPAN
1299 can be used to delete web content on a keyword basis.
1312 > An experimental collection of text replacements to disable malicious HTML and JavaScript
1313 code that exploits known security holes in Internet Explorer.
1316 > Presently, it only protects against Nimda and a cross-site scripting bug, and
1317 would need active maintenance to provide more substantial protection.
1330 > Some web sites have very specific problems, the cure for which doesn't apply
1331 anywhere else, or could even cause damage on other sites.
1334 > This is a collection of such site-specific cures which should only be applied
1335 to the sites they were intended for, which is what the supplied
1339 > file does. Users shouldn't need to change
1340 anything regarding this filter.
1353 > A CSS based block for Google text ads. Also removes a width limitation
1354 and the toolbar advertisement.
1367 > Another CSS based block, this time for Yahoo text ads. And removes
1368 a width limitation as well.
1381 > Another CSS based block, this time for MSN text ads. And removes
1382 tracking URLs, as well as a width limitation.
1395 > Cleans up some Blogspot blogs. Read the fine print before using this one!
1398 > This filter also intentionally removes some navigation stuff and sets the
1399 page width to 100%. As a result, some rounded <SPAN
1403 appear to early or not at all and as fixing this would require a browser
1404 that understands background-size (CSS3), they are removed instead.
1417 > Header filter to change the Content-Type from xml to html.
1430 > Header filter to change the Content-Type from html to xml.
1443 > Removes the non-standard <TT
1447 anchor and area HTML tags.
1455 >hide-tor-exit-notation</I
1460 > Header filter to remove the <B
1463 > exit node notation
1464 found in Host and Referer headers.
1476 SUMMARY="Footer navigation table"
1487 HREF="actions-file.html"
1505 HREF="templates.html"
1525 >Privoxy's Template Files</TD