9. The Filter File

All text substitutions that can be invoked through the filter action must first be defined in the filter file, which is typically called default.filter and which can be selected through the filterfile config option.

Typical reasons for doing such substitutions are to eliminate common annoyances in HTML and JavaScript, such as pop-up windows, exit consoles, crippled windows without navigation tools, the infamous <BLINK> tag etc, to suppress images with certain width and height attributes (standard banner sizes or web-bugs), or just to have fun. The possibilities are endless.

Filtering works on any text-based document type, including plain text, HTML, JavaScript, CSS etc. (all text/* MIME types). Substitutions are made at the source level, so if you want to "roll your own" filters, you should be familiar with HTML syntax.

Just like the actions files, the filter file is organized in sections, which are called filters here. Each filter consists of a heading line, that starts with the keyword FILTER:, followed by the filter's name, and a short (one line) description of what it does. Below that line come the jobs, i.e. lines that define the actual text substitutions. By convention, the name of a filter should describe what the filter eliminates. The comment is used in the web-based user interface.

Once a filter called name has been defined in the filter file, it can be invoked by using an action of the form +filter{name} in any actions file.

A filter header line for a filter called "foo" could look like this:

FILTER: foo Replace all "foo" with "bar"

Below that line, and up to the next header line, come the jobs that define what text replacements the filter executes. They are specified in a syntax that imitates Perl's s/// operator. If you are familiar with Perl, you will find this to be quite intuitive, and may want to look at the PCRS man page for the subtle differences to Perl behaviour. Most notably, the non-standard option letter U is supported, which turns the default to ungreedy matching.

If you are new to regular expressions, you might want to take a look at the Appendix on regular expressions, and see the Perl manual for the s/// operator's syntax and Perl-style regular expressions in general. The below examples might also help to get you started.

9.1. Filter File Tutorial

Now, let's complete our "foo" filter. We have already defined the heading, but the jobs are still missing. Since all it does is to replace "foo" with "bar", there is only one (trivial) job needed:

s/foo/bar/

But wait! Didn't the comment say that all occurrences of "foo" should be replaced? Our current job will only take care of the first "foo" on each page. For global substitution, we'll need to add the g option:

s/foo/bar/g

Our complete filter now looks like this:

FILTER: foo Replace all "foo" with "bar"
s/foo/bar/g

Let's look at some real filters for more interesting examples. Here you see a filter that protects against some common annoyances that arise from JavaScript abuse. Let's look at its jobs one after the other:

FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse

# Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
#
s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg

Following the header line and a comment, you see the job. Note that it uses | as the delimiter instead of /, because the pattern contains a forward slash, which would otherwise have to be escaped by a backslash (\).

Now, let's examine the pattern: it starts with the text <script.* enclosed in parentheses. Since the dot matches any character, and * means: "Match an arbitrary number of the element left of myself", this matches "<script", followed by any text, i.e. it matches the whole page, from the start of the first <script> tag.

That's more than we want, but the pattern continues: document\.referrer matches only the exact string "document.referrer". The dot needed to be escaped, i.e. preceded by a backslash, to take away its special meaning as a joker, and make it just a regular dot. So far, the meaning is: Match from the start of the first <script> tag in a the page, up to, and including, the text "document.referrer", if both are present in the page (and appear in that order).

But there's still more pattern to go. The next element, again enclosed in parentheses, is .*</script>. You already know what .* means, so the whole pattern translates to: Match from the start of the first <script> tag in a page to the end of the last <script> tag, provided that the text "document.referrer" appears somewhere in between.

This is still not the whole story, since we have ignored the options and the parentheses: The portions of the page matched by sub-patterns that are enclosed in parentheses, will be remembered and be available through the variables $1, $2, ... in the substitute. The U option switches to ungreedy matching, which means that the first .* in the pattern will only "eat up" all text in between "<script" and the first occurrence of "document.referrer", and that the second .* will only span the text up to the first "</script>" tag. Furthermore, the s option says that the match may span multiple lines in the page, and the g option again means that the substitution is global.

So, to summarize, the pattern means: Match all scripts that contain the text "document.referrer". Remember the parts of the script from (and including) the start tag up to (and excluding) the string "document.referrer" as $1, and the part following that string, up to and including the closing tag, as $2.

Now the pattern is deciphered, but wasn't this about substituting things? So lets look at the substitute: $1"Not Your Business!"$2 is easy to read: The text remembered as $1, followed by "Not Your Business!" (including the quotation marks!), followed by the text remembered as $2. This produces an exact copy of the original string, with the middle part (the "document.referrer") replaced by "Not Your Business!".

The whole job now reads: Replace "document.referrer" by "Not Your Business!" wherever it appears inside a <script> tag. Note that this job won't break JavaScript syntax, since both the original and the replacement are syntactically valid string objects. The script just won't have access to the referrer information anymore.

We'll show you two other jobs from the JavaScript taming department, but this time only point out the constructs of special interest:

# The status bar is for displaying link targets, not pointless blahblah
#
s/window\.status\s*=\s*['"].*?['"]/dUmMy=1/ig

\s stands for whitespace characters (space, tab, newline, carriage return, form feed), so that \s* means: "zero or more whitespace". The ? in .*? makes this matching of arbitrary text ungreedy. (Note that the U option is not set). The ['"] construct means: "a single or a double quote".

So what does this job do? It replaces assignments of single- or double-quoted strings to the "window.status" object with a dummy assignment (using a variable name that is hopefully odd enough not to conflict with real variables in scripts). Thus, it catches many cases where e.g. pointless descriptions are displayed in the status bar instead of the link target when you move your mouse over links.

# Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
#
s/(<body .*)onunload(.*>)/$1never$2/iU

Including the OnUnload event binding in the HTML DOM was a CRIME. When I close a browser window, I want it to close and die. Basta. This job replaces the "onunload" attribute in "<body>" tags with the dummy word never. Note that the i option makes the pattern matching case-insensitive.

The last example is from the fun department:

FILTER: fun Fun text replacements

# Spice the daily news:
#
s/microsoft(?!\.com)/MicroSuck/ig

Note the (?!\.com) part (a so-called negative lookahead) in the job's pattern, which means: Don't match, if the string ".com" appears directly following "microsoft" in the page. This prevents links to microsoft.com from being messed, while still replacing the word everywhere else.

# Buzzword Bingo (example for extended regex syntax)
#
s* industry[ -]leading \
|  cutting[ -]edge \
|  award[ -]winning # Comments are OK, too! \
|  high[ -]performance \
|  solutions[ -]based \
|  unmatched \
|  unparalleled \
|  unrivalled \
*<font color="red"><b>BINGO!</b></font> \
*igx

The x option in this job turns on extended syntax, and allows for e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.

You get the idea?