4 >The Filter File</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
10 TITLE="Privoxy 3.1.1 User Manual"
11 HREF="index.html"><LINK
14 HREF="actions-file.html"><LINK
17 HREF="templates.html"><LINK
20 HREF="../p_doc.css"></HEAD
31 SUMMARY="Header navigation table"
40 >Privoxy 3.1.1 User Manual</TH
48 HREF="actions-file.html"
76 NAME="FILTER-FILE">9. The Filter File</H1
78 > All text substitutions that can be invoked through the
82 HREF="actions-file.html#FILTER"
86 must first be defined in the filter file, which is typically
91 selected through the <TT
94 HREF="config.html#FILTERFILE"
100 > Typical reasons for doing such substitutions are to eliminate
101 common annoyances in HTML and JavaScript, such as pop-up windows,
102 exit consoles, crippled windows without navigation tools, the
103 infamous <BLINK> tag etc, to suppress images with certain
104 width and height attributes (standard banner sizes or web-bugs),
105 or just to have fun. The possibilities are endless.</P
107 > Filtering works on any text-based document type, including plain
108 text, HTML, JavaScript, CSS etc. (all <TT
112 MIME types). Substitutions are made at the source level, so if
115 >"roll your own"</SPAN
116 > filters, you should be
117 familiar with HTML syntax.</P
120 HREF="actions-file.html"
123 filter file is organized in sections, which are called <SPAN
130 here. Each filter consists of a heading line, that starts with the
147 >, and a short (one line)
154 > of what it does. Below that line
161 >, i.e. lines that define the actual
162 text substitutions. By convention, the name of a filter
163 should describe what the filter <SPAN
170 comment is used in the <A
171 HREF="http://config.privoxy.org/"
177 > Once a filter called <TT
183 in the filter file, it can be invoked by using an action of the form
187 HREF="actions-file.html#FILTER"
197 HREF="actions-file.html"
201 > A filter header line for a filter called <SPAN
215 >FILTER: foo Replace all "foo" with "bar"</PRE
221 > Below that line, and up to the next header line, come the jobs that
222 define what text replacements the filter executes. They are specified
223 in a syntax that imitates <A
224 HREF="http://www.perl.org/"
231 > operator. If you are familiar with Perl, you
232 will find this to be quite intuitive, and may want to look at the
234 HREF="http://www.oesterhelt.org/pcrs/pcrs.3.html"
238 for the subtle differences to Perl behaviour. Most notably, the non-standard
242 > is supported, which turns the default
243 to ungreedy matching.</P
245 > If you are new to regular expressions, you might want to take a look at
247 HREF="appendix.html#REGEX"
248 >Appendix on regular expressions</A
251 HREF="http://perldoc.com/perl5.6.1/pod/perl.html"
257 HREF="http://perldoc.com/perl5.6.1/pod/perlop.html#s-PATTERN-REPLACEMENT-egimosx"
263 > operator's syntax</A
265 HREF="http://perldoc.com/perl5.6.1/pod/perlre.html"
270 The below examples might also help to get you started.</P
276 NAME="AEN3018">9.1. Filter File Tutorial</H2
278 > Now, let's complete our <SPAN
281 > filter. We have already defined
282 the heading, but the jobs are still missing. Since all it does is to replace
289 >, there is only one (trivial) job
306 > But wait! Didn't the comment say that <SPAN
316 > should be replaced? Our current job will only take
317 care of the first <SPAN
320 > on each page. For global substitution,
321 we'll need to add the <TT
340 > Our complete filter now looks like this:</P
350 >FILTER: foo Replace all "foo" with "bar"
357 > Let's look at some real filters for more interesting examples. Here you see
358 a filter that protects against some common annoyances that arise from JavaScript
359 abuse. Let's look at its jobs one after the other:</P
369 >FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
371 # Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
373 s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg</PRE
379 > Following the header line and a comment, you see the job. Note that it uses
383 > as the delimiter instead of <TT
387 the pattern contains a forward slash, which would otherwise have to be escaped
393 > Now, let's examine the pattern: it starts with the text <TT
397 enclosed in parentheses. Since the dot matches any character, and <TT
403 >"Match an arbitrary number of the element left of myself"</SPAN
415 it matches the whole page, from the start of the first <script> tag.</P
417 > That's more than we want, but the pattern continues: <TT
419 >document\.referrer</TT
421 matches only the exact string <SPAN
423 >"document.referrer"</SPAN
431 >, i.e. preceded by a backslash, to take away its
432 special meaning as a joker, and make it just a regular dot. So far, the meaning is:
433 Match from the start of the first <script> tag in a the page, up to, and including,
436 >"document.referrer"</SPAN
444 in the page (and appear in that order).</P
446 > But there's still more pattern to go. The next element, again enclosed in parentheses,
449 >.*</script></TT
450 >. You already know what <TT
454 means, so the whole pattern translates to: Match from the start of the first <script>
455 tag in a page to the end of the last <script> tag, provided that the text
458 >"document.referrer"</SPAN
459 > appears somewhere in between.</P
461 > This is still not the whole story, since we have ignored the options and the parentheses:
462 The portions of the page matched by sub-patterns that are enclosed in parentheses, will be
463 remembered and be available through the variables <TT
467 the substitute. The <TT
470 > option switches to ungreedy matching, which means
474 > in the pattern will only <SPAN
478 text in between <SPAN
490 >"document.referrer"</SPAN
491 >, and that the second <TT
495 only span the text up to the <SPAN
503 >"</script>"</SPAN
505 tag. Furthermore, the <TT
508 > option says that the match may span
509 multiple lines in the page, and the <TT
512 > option again means that the
513 substitution is global.</P
515 > So, to summarize, the pattern means: Match all scripts that contain the text
518 >"document.referrer"</SPAN
519 >. Remember the parts of the script from
520 (and including) the start tag up to (and excluding) the string
523 >"document.referrer"</SPAN
527 >, and the part following
528 that string, up to and including the closing tag, as <TT
533 > Now the pattern is deciphered, but wasn't this about substituting things? So
534 lets look at the substitute: <TT
536 >$1"Not Your Business!"$2</TT
538 easy to read: The text remembered as <TT
544 >"Not Your Business!"</TT
552 the quotation marks!), followed by the text remembered as <TT
556 This produces an exact copy of the original string, with the middle part
559 >"document.referrer"</SPAN
566 > The whole job now reads: Replace <SPAN
568 >"document.referrer"</SPAN
572 >"Not Your Business!"</TT
573 > wherever it appears inside a
574 <script> tag. Note that this job won't break JavaScript syntax,
575 since both the original and the replacement are syntactically valid
576 string objects. The script just won't have access to the referrer
577 information anymore.</P
579 > We'll show you two other jobs from the JavaScript taming department, but
580 this time only point out the constructs of special interest:</P
590 ># The status bar is for displaying link targets, not pointless blahblah
592 s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig</PRE
601 > stands for whitespace characters (space, tab, newline,
602 carriage return, form feed), so that <TT
608 or more whitespace"</SPAN
616 makes this matching of arbitrary text ungreedy. (Note that the <TT
620 option is not set). The <TT
623 > construct means: <SPAN
632 > a double quote"</SPAN
637 a backreference to the first parenthesis just like <TT
641 with the difference that in the <SPAN
647 >, a backslash indicates
648 a backreference, whereas in the <SPAN
654 >, it's the dollar.</P
656 > So what does this job do? It replaces assignments of single- or double-quoted
659 >"window.status"</SPAN
660 > object with a dummy assignment
661 (using a variable name that is hopefully odd enough not to conflict with
662 real variables in scripts). Thus, it catches many cases where e.g. pointless
663 descriptions are displayed in the status bar instead of the link target when
664 you move your mouse over links.</P
674 ># Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
676 s/(<body [^>]*)onunload(.*>)/$1never$2/iU</PRE
684 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
688 > in the HTML DOM was a <SPAN
695 When I close a browser window, I want it to close and die. Basta.
696 This job replaces the <SPAN
702 >"<body>"</SPAN
703 > tags with the dummy word <TT
710 > option makes the pattern matching
711 case-insensitive. Also note that ungreedy matching alone doesn't always guarantee
712 a minimal match: In the first parenthesis, we had to use <TT
719 > to prevent the match from exceeding the
720 <body> tag if it doesn't contain <SPAN
726 > The last example is from the fun department:</P
736 >FILTER: fun Fun text replacements
738 # Spice the daily news:
740 s/microsoft(?!\.com)/MicroSuck/ig</PRE
749 > part (a so-called negative lookahead)
750 in the job's pattern, which means: Don't match, if the string
754 > appears directly following <SPAN
758 in the page. This prevents links to microsoft.com from being trashed, while
759 still replacing the word everywhere else.</P
769 ># Buzzword Bingo (example for extended regex syntax)
771 s* industry[ -]leading \
773 | customer[ -]focused \
775 | award[ -]winning # Comments are OK, too! \
776 | high[ -]performance \
777 | solutions[ -]based \
781 *<font color="red"><b>BINGO!</b></font> \
791 > option in this job turns on extended syntax, and allows for
792 e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting. </P
794 > You get the idea?</P
802 SUMMARY="Footer navigation table"
813 HREF="actions-file.html"
831 HREF="templates.html"