4 >The Filter File</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.60"><LINK
9 TITLE="Privoxy User Manual"
10 HREF="index.html"><LINK
13 HREF="actions-file.html"><LINK
16 HREF="templates.html"><LINK
19 HREF="../p_doc.css"></HEAD
38 >Privoxy User Manual</TH
46 HREF="actions-file.html"
73 >9. The Filter File</A
76 > All text substitutions that can be invoked through the
80 HREF="actions-file.html#FILTER"
84 must first be defined in the filter file, which is typically
89 selected through the <TT
92 HREF="config.html#FILTERFILE"
98 > Typical reasons for doing such substitutions are to eliminate
99 common annoyances in HTML and JavaScript, such as pop-up windows,
100 exit consoles, crippled windows without navigation tools, the
101 infamous <BLINK> tag etc, to suppress images with certain
102 width and height attributes (standard banner sizes or web-bugs),
103 or just to have fun. The possibilities are endless.</P
105 > Filtering works on any text-based document type, including plain
106 text, HTML, JavaScript, CSS etc. (all <TT
110 MIME types). Substitutions are made at the source level, so if
113 >"roll your own"</SPAN
114 > filters, you should be
115 familiar with HTML syntax.</P
118 HREF="actions-file.html"
121 filter file is organized in sections, which are called <I
125 here. Each filter consists of a heading line, that starts with the
136 >, and a short (one line)
140 > of what it does. Below that line
144 >, i.e. lines that define the actual
145 text substitutions. By convention, the name of a filter
146 should describe what the filter <I
150 comment is used in the <A
151 HREF="http://config.privoxy.org/"
157 > Once a filter called <TT
163 in the filter file, it can be invoked by using an action of the form
167 HREF="actions-file.html#FILTER"
177 HREF="actions-file.html"
181 > A filter header line for a filter called <SPAN
195 >FILTER: foo Replace all "foo" with "bar"</PRE
201 > Below that line, and up to the next header line, come the jobs that
202 define what text replacements the filter executes. They are specified
203 in a syntax that imitates <A
204 HREF="http://www.perl.org/"
211 > operator. If you are familiar with Perl, you
212 will find this to be quite intuitive, and may want to look at the
214 HREF="http://www.oesterhelt.org/pcrs/pcrs.1.html"
218 for the subtle differences to Perl behaviour. Most notably, the non-standard
222 > is supported, which turns the default
223 to ungreedy matching.</P
225 > If you are new to regular expressions, you might want to take a look at
227 HREF="appendix.html#REGEX"
228 >Appendix on regular expressions</A
231 HREF="http://perldoc.com/perl5.6.1/pod/perl.html"
237 HREF="http://perldoc.com/perl5.6.1/pod/perlop.html#s-PATTERN-REPLACEMENT-egimosx"
243 > operator's syntax</A
245 HREF="http://perldoc.com/perl5.6.1/pod/perlre.html"
250 The below examples might also help to get you started.</P
257 >9.1. Filter File Tutorial</A
260 > Now, let's complete our <SPAN
263 > filter. We have already defined
264 the heading, but the jobs are still missing. Since all it does is to replace
271 >, there is only one (trivial) job
288 > But wait! Didn't the comment say that <I
295 > should be replaced? Our current job will only take
296 care of the first <SPAN
299 > on each page. For global substitution,
300 we'll need to add the <TT
319 > Our complete filter now looks like this:</P
329 >FILTER: foo Replace all "foo" with "bar"
336 > Let's look at some real filters for more interesting examples. Here you see
337 a filter that protects against some common annoyances that arise from JavaScript
338 abuse. Let's look at its jobs one after the other:</P
348 >FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
350 # Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
352 s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg</PRE
358 > Following the header line and a comment, you see the job. Note that it uses
362 > as the delimiter instead of <TT
366 the pattern contains a forward slash, which would otherwise have to be escaped
372 > Now, let's examine the pattern: it starts with the text <TT
376 enclosed in parentheses. Since the dot matches any character, and <TT
382 >"Match an arbitrary number of the element left of myself"</SPAN
391 it matches the whole page, from the start of the first <script> tag.</P
393 > That's more than we want, but the pattern continues: <TT
395 >document\.referrer</TT
397 matches only the exact string <SPAN
399 >"document.referrer"</SPAN
404 >, i.e. preceded by a backslash, to take away its
405 special meaning as a joker, and make it just a regular dot. So far, the meaning is:
406 Match from the start of the first <script> tag in a the page, up to, and including,
409 >"document.referrer"</SPAN
414 in the page (and appear in that order).</P
416 > But there's still more pattern to go. The next element, again enclosed in parentheses,
419 >.*</script></TT
420 >. You already know what <TT
424 means, so the whole pattern translates to: Match from the start of the first <script>
425 tag in a page to the end of the last <script> tag, provided that the text
428 >"document.referrer"</SPAN
429 > appears somewhere in between.</P
431 > This is still not the whole story, since we have ignored the options and the parentheses:
432 The portions of the page matched by sub-patterns that are enclosed in parentheses, will be
433 remembered and be available through the variables <TT
437 the substitute. The <TT
440 > option switches to ungreedy matching, which means
444 > in the pattern will only <SPAN
448 text in between <SPAN
457 >"document.referrer"</SPAN
458 >, and that the second <TT
462 only span the text up to the <I
467 >"</script>"</SPAN
469 tag. Furthermore, the <TT
472 > option says that the match may span
473 multiple lines in the page, and the <TT
476 > option again means that the
477 substitution is global.</P
479 > So, to summarize, the pattern means: Match all scripts that contain the text
482 >"document.referrer"</SPAN
483 >. Remember the parts of the script from
484 (and including) the start tag up to (and excluding) the string
487 >"document.referrer"</SPAN
491 >, and the part following
492 that string, up to and including the closing tag, as <TT
497 > Now the pattern is deciphered, but wasn't this about substituting things? So
498 lets look at the substitute: <TT
500 >$1"Not Your Business!"$2</TT
502 easy to read: The text remembered as <TT
508 >"Not Your Business!"</TT
513 the quotation marks!), followed by the text remembered as <TT
517 This produces an exact copy of the original string, with the middle part
520 >"document.referrer"</SPAN
527 > The whole job now reads: Replace <SPAN
529 >"document.referrer"</SPAN
533 >"Not Your Business!"</TT
534 > wherever it appears inside a
535 <script> tag. Note that this job won't break JavaScript syntax,
536 since both the original and the replacement are syntactically valid
537 string objects. The script just won't have access to the referrer
538 information anymore.</P
540 > We'll show you two other jobs from the JavaScript taming department, but
541 this time only point out the constructs of special interest:</P
551 ># The status bar is for displaying link targets, not pointless blahblah
553 s/window\.status\s*=\s*['"].*?['"]/dUmMy=1/ig</PRE
562 > stands for whitespace characters (space, tab, newline,
563 carriage return, form feed), so that <TT
569 or more whitespace"</SPAN
577 makes this matching of arbitrary text ungreedy. (Note that the <TT
581 option is not set). The <TT
584 > construct means: <SPAN
590 > a double quote"</SPAN
593 > So what does this job do? It replaces assignments of single- or double-quoted
596 >"window.status"</SPAN
597 > object with a dummy assignment
598 (using a variable name that is hopefully odd enough not to conflict with
599 real variables in scripts). Thus, it catches many cases where e.g. pointless
600 descriptions are displayed in the status bar instead of the link target when
601 you move your mouse over links.</P
611 ># Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
613 s/(<body .*)onunload(.*>)/$1never$2/iU</PRE
621 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
625 > in the HTML DOM was a <I
629 When I close a browser window, I want it to close and die. Basta.
630 This job replaces the <SPAN
636 >"<body>"</SPAN
637 > tags with the dummy word <TT
644 > option makes the pattern matching
647 > The last example is from the fun department:</P
657 >FILTER: fun Fun text replacements
659 # Spice the daily news:
661 s/microsoft(?!\.com)/MicroSuck/ig</PRE
670 > part (a so-called negative lookahead)
671 in the job's pattern, which means: Don't match, if the string
675 > appears directly following <SPAN
679 in the page. This prevents links to microsoft.com from being messed, while
680 still replacing the word everywhere else.</P
690 ># Buzzword Bingo (example for extended regex syntax)
692 s* industry[ -]leading \
694 | award[ -]winning # Comments are OK, too! \
695 | high[ -]performance \
696 | solutions[ -]based \
700 *<font color="red"><b>BINGO!</b></font> \
710 > option in this job turns on extended syntax, and allows for
711 e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.</P
713 > You get the idea?</P
731 HREF="actions-file.html"
747 HREF="templates.html"