1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
5 >The Filter File</TITLE
8 CONTENT="Modular DocBook HTML Stylesheet Version 1.7"><LINK
10 TITLE="Privoxy 3.0.3 User Manual"
11 HREF="index.html"><LINK
14 HREF="actions-file.html"><LINK
17 HREF="templates.html"><LINK
20 HREF="../p_doc.css"></HEAD
31 SUMMARY="Header navigation table"
40 >Privoxy 3.0.3 User Manual</TH
48 HREF="actions-file.html"
77 >9. The Filter File</A
80 > All text substitutions that can be invoked through the
84 HREF="actions-file.html#FILTER"
88 must first be defined in the filter file, which is typically
93 selected through the <VAR
96 HREF="config.html#FILTERFILE"
102 > Typical reasons for doing such substitutions are to eliminate
103 common annoyances in HTML and JavaScript, such as pop-up windows,
104 exit consoles, crippled windows without navigation tools, the
105 infamous <BLINK> tag etc, to suppress images with certain
106 width and height attributes (standard banner sizes or web-bugs),
107 or just to have fun. The possibilities are endless.</P
109 > Filtering works on any text-based document type, including
110 HTML, JavaScript, CSS etc. (all <VAR
124 Substitutions are made at the source level, so if you want to <SPAN
128 > filters, you should be familiar with HTML syntax.</P
131 HREF="actions-file.html"
134 filter file is organized in sections, which are called <SPAN
141 here. Each filter consists of a heading line, that starts with the
158 >, and a short (one line)
165 > of what it does. Below that line
172 >, i.e. lines that define the actual
173 text substitutions. By convention, the name of a filter
174 should describe what the filter <SPAN
181 comment is used in the <A
182 HREF="http://config.privoxy.org/"
188 > Once a filter called <VAR
192 in the filter file, it can be invoked by using an action of the form
196 HREF="actions-file.html#FILTER"
204 HREF="actions-file.html"
208 > A filter header line for a filter called <SPAN
222 >FILTER: foo Replace all "foo" with "bar"</PRE
228 > Below that line, and up to the next header line, come the jobs that
229 define what text replacements the filter executes. They are specified
230 in a syntax that imitates <A
231 HREF="http://www.perl.org/"
238 > operator. If you are familiar with Perl, you
239 will find this to be quite intuitive, and may want to look at the
241 HREF="http://www.oesterhelt.org/pcrs/pcrs.3.html"
245 for the subtle differences to Perl behaviour. Most notably, the non-standard
249 > is supported, which turns the default
250 to ungreedy matching.</P
252 > If you are new to regular expressions, you might want to take a look at
254 HREF="appendix.html#REGEX"
255 >Appendix on regular expressions</A
258 HREF="http://perldoc.com/perl5.6.1/pod/perl.html"
264 HREF="http://perldoc.com/perl5.6.1/pod/perlop.html#s-PATTERN-REPLACEMENT-egimosx"
270 > operator's syntax</A
272 HREF="http://perldoc.com/perl5.6.1/pod/perlre.html"
277 The below examples might also help to get you started.</P
284 >9.1. Filter File Tutorial</A
287 > Now, let's complete our <SPAN
290 > filter. We have already defined
291 the heading, but the jobs are still missing. Since all it does is to replace
298 >, there is only one (trivial) job
315 > But wait! Didn't the comment say that <SPAN
325 > should be replaced? Our current job will only take
326 care of the first <SPAN
329 > on each page. For global substitution,
330 we'll need to add the <VAR
349 > Our complete filter now looks like this:</P
359 >FILTER: foo Replace all "foo" with "bar"
366 > Let's look at some real filters for more interesting examples. Here you see
367 a filter that protects against some common annoyances that arise from JavaScript
368 abuse. Let's look at its jobs one after the other:</P
378 >FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
380 # Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
382 s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg</PRE
388 > Following the header line and a comment, you see the job. Note that it uses
392 > as the delimiter instead of <VAR
396 the pattern contains a forward slash, which would otherwise have to be escaped
402 > Now, let's examine the pattern: it starts with the text <VAR
406 enclosed in parentheses. Since the dot matches any character, and <VAR
412 >"Match an arbitrary number of the element left of myself"</SPAN
424 it matches the whole page, from the start of the first <script> tag.</P
426 > That's more than we want, but the pattern continues: <VAR
428 >document\.referrer</VAR
430 matches only the exact string <SPAN
432 >"document.referrer"</SPAN
440 >, i.e. preceded by a backslash, to take away its
441 special meaning as a joker, and make it just a regular dot. So far, the meaning is:
442 Match from the start of the first <script> tag in a the page, up to, and including,
445 >"document.referrer"</SPAN
453 in the page (and appear in that order).</P
455 > But there's still more pattern to go. The next element, again enclosed in parentheses,
458 >.*</script></VAR
459 >. You already know what <VAR
463 means, so the whole pattern translates to: Match from the start of the first <script>
464 tag in a page to the end of the last <script> tag, provided that the text
467 >"document.referrer"</SPAN
468 > appears somewhere in between.</P
470 > This is still not the whole story, since we have ignored the options and the parentheses:
471 The portions of the page matched by sub-patterns that are enclosed in parentheses, will be
472 remembered and be available through the variables <VAR
476 the substitute. The <VAR
479 > option switches to ungreedy matching, which means
483 > in the pattern will only <SPAN
487 text in between <SPAN
499 >"document.referrer"</SPAN
500 >, and that the second <VAR
504 only span the text up to the <SPAN
512 >"</script>"</SPAN
514 tag. Furthermore, the <VAR
517 > option says that the match may span
518 multiple lines in the page, and the <VAR
521 > option again means that the
522 substitution is global.</P
524 > So, to summarize, the pattern means: Match all scripts that contain the text
527 >"document.referrer"</SPAN
528 >. Remember the parts of the script from
529 (and including) the start tag up to (and excluding) the string
532 >"document.referrer"</SPAN
536 >, and the part following
537 that string, up to and including the closing tag, as <VAR
542 > Now the pattern is deciphered, but wasn't this about substituting things? So
543 lets look at the substitute: <VAR
545 >$1"Not Your Business!"$2</VAR
547 easy to read: The text remembered as <VAR
553 >"Not Your Business!"</VAR
561 the quotation marks!), followed by the text remembered as <VAR
565 This produces an exact copy of the original string, with the middle part
568 >"document.referrer"</SPAN
575 > The whole job now reads: Replace <SPAN
577 >"document.referrer"</SPAN
581 >"Not Your Business!"</VAR
582 > wherever it appears inside a
583 <script> tag. Note that this job won't break JavaScript syntax,
584 since both the original and the replacement are syntactically valid
585 string objects. The script just won't have access to the referrer
586 information anymore.</P
588 > We'll show you two other jobs from the JavaScript taming department, but
589 this time only point out the constructs of special interest:</P
599 ># The status bar is for displaying link targets, not pointless blahblah
601 s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig</PRE
610 > stands for whitespace characters (space, tab, newline,
611 carriage return, form feed), so that <VAR
617 or more whitespace"</SPAN
625 makes this matching of arbitrary text ungreedy. (Note that the <VAR
629 option is not set). The <VAR
632 > construct means: <SPAN
641 > a double quote"</SPAN
646 a backreference to the first parenthesis just like <VAR
650 with the difference that in the <SPAN
656 >, a backslash indicates
657 a backreference, whereas in the <SPAN
663 >, it's the dollar.</P
665 > So what does this job do? It replaces assignments of single- or double-quoted
668 >"window.status"</SPAN
669 > object with a dummy assignment
670 (using a variable name that is hopefully odd enough not to conflict with
671 real variables in scripts). Thus, it catches many cases where e.g. pointless
672 descriptions are displayed in the status bar instead of the link target when
673 you move your mouse over links.</P
683 ># Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
685 s/(<body [^>]*)onunload(.*>)/$1never$2/iU</PRE
693 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
697 > in the HTML DOM was a <SPAN
704 When I close a browser window, I want it to close and die. Basta.
705 This job replaces the <SPAN
711 >"<body>"</SPAN
712 > tags with the dummy word <VAR
719 > option makes the pattern matching
720 case-insensitive. Also note that ungreedy matching alone doesn't always guarantee
721 a minimal match: In the first parenthesis, we had to use <VAR
728 > to prevent the match from exceeding the
729 <body> tag if it doesn't contain <SPAN
735 > The last example is from the fun department:</P
745 >FILTER: fun Fun text replacements
747 # Spice the daily news:
749 s/microsoft(?!\.com)/MicroSuck/ig</PRE
758 > part (a so-called negative lookahead)
759 in the job's pattern, which means: Don't match, if the string
763 > appears directly following <SPAN
767 in the page. This prevents links to microsoft.com from being trashed, while
768 still replacing the word everywhere else.</P
778 ># Buzzword Bingo (example for extended regex syntax)
780 s* industry[ -]leading \
782 | customer[ -]focused \
784 | award[ -]winning # Comments are OK, too! \
785 | high[ -]performance \
786 | solutions[ -]based \
790 *<font color="red"><b>BINGO!</b></font> \
800 > option in this job turns on extended syntax, and allows for
801 e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting. </P
803 > You get the idea?</P
810 NAME="PREDEFINED-FILTERS"
811 >9.2. The Pre-defined Filters</A
814 >The distribution <TT
817 > file contains a selection of
818 pre-defined filters for your convenience:</P
834 > The purpose of this filter is to get rid of particularly annoying JavaScript abuse.
841 > replaces JavaScript references to the browser's referrer information
842 with the string "Not Your Business!". This compliments the <VAR
845 HREF="actions-file.html#HIDE-REFERRER"
848 > action on the content level.
853 > removes the bindings to the DOM's
855 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
859 > which we feel has no right to exist and is responsible for most <SPAN
861 >"exit consoles"</SPAN
863 nasty windows that pop up when you close another one.
868 > removes code that causes new windows to be opened with undesired properties, such as being
869 full-screen, non-resizable, without location, status or menu bar etc.
886 > This is a very radical measure. It removes virtually all JavaScript event bindings, which
887 means that scripts can not react to user actions such as mouse movements or clicks, window
888 resizing etc, anymore.
895 >strongly discourage</I
897 > using this filter as a default since it breaks
898 many legitimate scripts. It is meant for use only on extra-nasty sites (should you really
912 > This filter will undo many common instances of HTML based abuse.
922 are neutralized (yeah baby!), and browser windows will be created as
923 resizable (as of course they should be!), and will have location,
924 scroll and menu bars -- even if specified otherwise.
937 > Most cookies are set in the HTTP dialogue, where they can be intercepted
942 HREF="actions-file.html#CRUNCH-INCOMING-COOKIES"
943 >crunch-incoming-cookies</A
949 HREF="actions-file.html#CRUNCH-OUTGOING-COOKIES"
950 >crunch-outgoing-cookies</A
953 actions. But web sites increasingly make use of HTML meta tags and JavaScript
954 to sneak cookies to the browser on the content level.
957 > This filter disables HTML and JavaScript code that reads or sets cookies. Use
958 it wherever you would also use the cookie crunch actions.
971 > Disable any refresh tags if the interval is greater than nine seconds (so
972 that redirections done via refresh tags are not destroyed). This is useful
973 for dial-on-demand setups, or for those who find this HTML feature
982 >unsolicited-popups</I
987 > This filter attempts to prevent only <SPAN
991 windows from opening, yet still allow pop-up windows that the user
992 has explicitly chosen to open. It was added in version 3.0.1,
993 as an improvement over earlier such filters.
996 > Technical note: The filter works by redefining the window.open JavaScript
997 function to a dummy function during the loading and rendering phase of each
998 HTML page access, and restoring the function afterwards.
1011 > Attempt to prevent <SPAN
1017 > pop-up windows from opening.
1018 Note this should be used with more discretion than the above, since it is
1019 more likely to break some sites that require pop-ups for normal usage. Use
1033 > This is a helper filter that has no value if used alone. It makes the
1036 >banners-by-size</VAR
1039 >banners-by-link</VAR
1041 (see below) filters more effective and should be enabled together with them.
1054 > This filter removes image tags purely based on what size they are. Fortunately
1055 for us, many ads and banner images tend to conform to certain standardized
1056 sizes, which makes this filter quite effective for ad stripping purposes.
1059 > Occasionally this filter will cause false positives on images that are not ads,
1060 but just happen to be of one of the standard banner sizes.
1073 > This is an experimental filter that attempts to kill any banners if
1074 their URLs seem to point to known or suspected click trackers. It is currently
1075 not of much value and is not recommended for use by default.
1088 > Webbugs are small, invisible images (technically 1X1 GIF images), that
1089 are used to track users across websites, and collect information on them.
1090 As an HTML page is loaded by the browser, an embedded image tag causes the
1091 browser to contact a third-party site, disclosing the tracking information
1092 through the requested URL and/or cookies for that third-party domain, without
1093 the use ever becoming aware of the interaction with the third-party site.
1094 HTML-ized spam also uses a similar technique to verify email addresses.
1097 > This filter removes the HTML code that loads such <SPAN
1113 > A rather special-purpose filter that can be used to enlarge textareas (those
1114 multi-line text boxes in web forms) and turn off hard word wrap in them.
1115 It was written for the sourceforge.net tracker system where such boxes are
1116 a nuisance, but it can be handy on other sites, too.
1119 > It is not recommended to use this filter as a default.
1132 > Many consider windows that move, or resize themselves to be abusive. This filter
1133 neutralizes the related JavaScript code. Note that some sites might not display
1134 or behave as intended when using this filter.
1142 >frameset-borders</I
1147 > Some web designers seem to assume that everyone in the world will view their
1148 web sites using the same browser brand and version, screen resolution etc,
1149 because only that assumption could explain why they'd use static frame sizes,
1150 yet prevent their frames from being resized by the user, should they be too
1151 small to show their whole content.
1154 > This filter removes the related HTML code. It should only be applied to sites
1168 > Many Microsoft products that generate HTML use non-standard extensions (read:
1169 violations) of the ISO 8859-1 aka Latin-1 character set. This causes those
1170 HTML documents to display with errors on standard-compliant platforms.
1173 > This filter translates the MS-only characters into Latin-1 equivalents.
1174 It is not necessary when using MS products, and will cause corruption of
1175 all documents that use 8-bit character sets other than Latin-1. It's mostly
1176 worthwhile for Europeans on non-MS platforms, if wierd garbage characters
1177 sometimes appear on some pages.
1190 > A filter for shockwave haters. As the name suggests, this filter strips code
1191 out of web pages that is used to embed shockwave flash objects.
1201 >quicktime-kioskmode</I
1206 > Change HTML code that embeds Quicktime objects so that kioskmode, which
1207 prevents saving, is disabled.
1220 > Text replacements for subversive browsing fun. Make fun of your favorite
1221 Monopolist or play buzzword bingo.
1234 > A demonstration-only filter that shows how <SPAN
1238 can be used to delete web content on a keyword basis.
1251 > A collection of text replacements to disable malicious HTML and JavaScript
1252 code that exploits known security holes in Internet Explorer.
1255 > Presently, it only protects against Nimda and a cross-site scripting bug, and
1256 would need active maintenance to provide more substantial protection.
1269 > Some web sites have very specific problems, the cure for which doesn't apply
1270 anywhere else, or could even cause damage on other sites.
1273 > This is a collection of such site-specific cures which should only be applied
1274 to the sites they were intended for, which is what the supplied
1278 > file does. Users shouldn't need to change
1279 anything regarding this filter.
1291 SUMMARY="Footer navigation table"
1302 HREF="actions-file.html"
1320 HREF="templates.html"