4 >The Filter File</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.64
10 TITLE="Privoxy User Manual"
11 HREF="index.html"><LINK
14 HREF="actions-file.html"><LINK
17 HREF="templates.html"><LINK
20 HREF="../p_doc.css"></HEAD
39 >Privoxy User Manual</TH
47 HREF="actions-file.html"
74 >9. The Filter File</A
77 > All text substitutions that can be invoked through the
81 HREF="actions-file.html#FILTER"
85 must first be defined in the filter file, which is typically
90 selected through the <TT
93 HREF="config.html#FILTERFILE"
99 > Typical reasons for doing such substitutions are to eliminate
100 common annoyances in HTML and JavaScript, such as pop-up windows,
101 exit consoles, crippled windows without navigation tools, the
102 infamous <BLINK> tag etc, to suppress images with certain
103 width and height attributes (standard banner sizes or web-bugs),
104 or just to have fun. The possibilities are endless.</P
106 > Filtering works on any text-based document type, including plain
107 text, HTML, JavaScript, CSS etc. (all <TT
111 MIME types). Substitutions are made at the source level, so if
114 >"roll your own"</SPAN
115 > filters, you should be
116 familiar with HTML syntax.</P
119 HREF="actions-file.html"
122 filter file is organized in sections, which are called <I
126 here. Each filter consists of a heading line, that starts with the
137 >, and a short (one line)
141 > of what it does. Below that line
145 >, i.e. lines that define the actual
146 text substitutions. By convention, the name of a filter
147 should describe what the filter <I
151 comment is used in the <A
152 HREF="http://config.privoxy.org/"
158 > Once a filter called <TT
164 in the filter file, it can be invoked by using an action of the form
168 HREF="actions-file.html#FILTER"
178 HREF="actions-file.html"
182 > A filter header line for a filter called <SPAN
196 >FILTER: foo Replace all "foo" with "bar"</PRE
202 > Below that line, and up to the next header line, come the jobs that
203 define what text replacements the filter executes. They are specified
204 in a syntax that imitates <A
205 HREF="http://www.perl.org/"
212 > operator. If you are familiar with Perl, you
213 will find this to be quite intuitive, and may want to look at the
215 HREF="http://www.oesterhelt.org/pcrs/pcrs.1.html"
219 for the subtle differences to Perl behaviour. Most notably, the non-standard
223 > is supported, which turns the default
224 to ungreedy matching.</P
226 > If you are new to regular expressions, you might want to take a look at
228 HREF="appendix.html#REGEX"
229 >Appendix on regular expressions</A
232 HREF="http://perldoc.com/perl5.6.1/pod/perl.html"
238 HREF="http://perldoc.com/perl5.6.1/pod/perlop.html#s-PATTERN-REPLACEMENT-egimosx"
244 > operator's syntax</A
246 HREF="http://perldoc.com/perl5.6.1/pod/perlre.html"
251 The below examples might also help to get you started.</P
258 >9.1. Filter File Tutorial</A
261 > Now, let's complete our <SPAN
264 > filter. We have already defined
265 the heading, but the jobs are still missing. Since all it does is to replace
272 >, there is only one (trivial) job
289 > But wait! Didn't the comment say that <I
296 > should be replaced? Our current job will only take
297 care of the first <SPAN
300 > on each page. For global substitution,
301 we'll need to add the <TT
320 > Our complete filter now looks like this:</P
330 >FILTER: foo Replace all "foo" with "bar"
337 > Let's look at some real filters for more interesting examples. Here you see
338 a filter that protects against some common annoyances that arise from JavaScript
339 abuse. Let's look at its jobs one after the other:</P
349 >FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
351 # Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
353 s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg</PRE
359 > Following the header line and a comment, you see the job. Note that it uses
363 > as the delimiter instead of <TT
367 the pattern contains a forward slash, which would otherwise have to be escaped
373 > Now, let's examine the pattern: it starts with the text <TT
377 enclosed in parentheses. Since the dot matches any character, and <TT
383 >"Match an arbitrary number of the element left of myself"</SPAN
392 it matches the whole page, from the start of the first <script> tag.</P
394 > That's more than we want, but the pattern continues: <TT
396 >document\.referrer</TT
398 matches only the exact string <SPAN
400 >"document.referrer"</SPAN
405 >, i.e. preceded by a backslash, to take away its
406 special meaning as a joker, and make it just a regular dot. So far, the meaning is:
407 Match from the start of the first <script> tag in a the page, up to, and including,
410 >"document.referrer"</SPAN
415 in the page (and appear in that order).</P
417 > But there's still more pattern to go. The next element, again enclosed in parentheses,
420 >.*</script></TT
421 >. You already know what <TT
425 means, so the whole pattern translates to: Match from the start of the first <script>
426 tag in a page to the end of the last <script> tag, provided that the text
429 >"document.referrer"</SPAN
430 > appears somewhere in between.</P
432 > This is still not the whole story, since we have ignored the options and the parentheses:
433 The portions of the page matched by sub-patterns that are enclosed in parentheses, will be
434 remembered and be available through the variables <TT
438 the substitute. The <TT
441 > option switches to ungreedy matching, which means
445 > in the pattern will only <SPAN
449 text in between <SPAN
458 >"document.referrer"</SPAN
459 >, and that the second <TT
463 only span the text up to the <I
468 >"</script>"</SPAN
470 tag. Furthermore, the <TT
473 > option says that the match may span
474 multiple lines in the page, and the <TT
477 > option again means that the
478 substitution is global.</P
480 > So, to summarize, the pattern means: Match all scripts that contain the text
483 >"document.referrer"</SPAN
484 >. Remember the parts of the script from
485 (and including) the start tag up to (and excluding) the string
488 >"document.referrer"</SPAN
492 >, and the part following
493 that string, up to and including the closing tag, as <TT
498 > Now the pattern is deciphered, but wasn't this about substituting things? So
499 lets look at the substitute: <TT
501 >$1"Not Your Business!"$2</TT
503 easy to read: The text remembered as <TT
509 >"Not Your Business!"</TT
514 the quotation marks!), followed by the text remembered as <TT
518 This produces an exact copy of the original string, with the middle part
521 >"document.referrer"</SPAN
528 > The whole job now reads: Replace <SPAN
530 >"document.referrer"</SPAN
534 >"Not Your Business!"</TT
535 > wherever it appears inside a
536 <script> tag. Note that this job won't break JavaScript syntax,
537 since both the original and the replacement are syntactically valid
538 string objects. The script just won't have access to the referrer
539 information anymore.</P
541 > We'll show you two other jobs from the JavaScript taming department, but
542 this time only point out the constructs of special interest:</P
552 ># The status bar is for displaying link targets, not pointless blahblah
554 s/window\.status\s*=\s*['"].*?['"]/dUmMy=1/ig</PRE
563 > stands for whitespace characters (space, tab, newline,
564 carriage return, form feed), so that <TT
570 or more whitespace"</SPAN
578 makes this matching of arbitrary text ungreedy. (Note that the <TT
582 option is not set). The <TT
585 > construct means: <SPAN
591 > a double quote"</SPAN
594 > So what does this job do? It replaces assignments of single- or double-quoted
597 >"window.status"</SPAN
598 > object with a dummy assignment
599 (using a variable name that is hopefully odd enough not to conflict with
600 real variables in scripts). Thus, it catches many cases where e.g. pointless
601 descriptions are displayed in the status bar instead of the link target when
602 you move your mouse over links.</P
612 ># Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
614 s/(<body .*)onunload(.*>)/$1never$2/iU</PRE
622 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
626 > in the HTML DOM was a <I
630 When I close a browser window, I want it to close and die. Basta.
631 This job replaces the <SPAN
637 >"<body>"</SPAN
638 > tags with the dummy word <TT
645 > option makes the pattern matching
648 > The last example is from the fun department:</P
658 >FILTER: fun Fun text replacements
660 # Spice the daily news:
662 s/microsoft(?!\.com)/MicroSuck/ig</PRE
671 > part (a so-called negative lookahead)
672 in the job's pattern, which means: Don't match, if the string
676 > appears directly following <SPAN
680 in the page. This prevents links to microsoft.com from being messed, while
681 still replacing the word everywhere else.</P
691 ># Buzzword Bingo (example for extended regex syntax)
693 s* industry[ -]leading \
695 | award[ -]winning # Comments are OK, too! \
696 | high[ -]performance \
697 | solutions[ -]based \
701 *<font color="red"><b>BINGO!</b></font> \
711 > option in this job turns on extended syntax, and allows for
712 e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.</P
714 > You get the idea?</P
732 HREF="actions-file.html"
748 HREF="templates.html"