The Filter File

Privoxy 3.0.8 User Manual9. The Filter File9. Filter Files

Any web page can be dynamically modified with the filter file. This - modification can be removal, or re-writing, of any web page content, - including tags and non-visible content. The default filter file is - oddly enough default.filter, located in the config - directory.

This is potentially a very powerful feature, and requires knowledge of both - On-the-fly text substitutions need + to be defined in a "regular expression" and HTML in order create custom - filters. But, there are a number of useful filters included with - "filter file". Once defined, they + can then be invoked as an "action".

Privoxy for many common situations.

supports three different filter actions: + filter to + rewrite the content that is send to the client, + client-header-filter + to rewrite headers that are send by the client, and + server-header-filter + to rewrite headers that are send by the server.

The included example file is divided into sections. Each section begins - with the Privoxy also supports two tagger actions: + FILTER keyword, followed by the identifier - for that section, e.g. "FILTER: webbugs". Each section performs - a similar type of filtering, such as "html-annoyances"client-header-tagger + and + server-header-tagger. + Taggers and filters use the same syntax in the filter files, the difference + is that taggers don't modify the text they are filtering, but use a rewritten + version of the filtered text as tag. The tags can then be used to change the + applying actions through sections with tag-patterns.

This file uses regular expressions to alter or remove any string in the - target page. The expressions can only operate on one line at a time. Some - examples from the included default Multiple filter files can be defined through the filterfile config directive. The filters + as supplied by the developers are located in + default.filter:

. It is recommended that any locally + defined or modified filters go in a separately defined file such as + user.filter. + Stop web pages from displaying annoying messages in the status bar by - deleting such references: Common tasks for content filters are to eliminate common annoyances in + HTML and JavaScript, such as pop-up windows, + exit consoles, crippled windows without navigation tools, the + infamous <BLINK> tag etc, to suppress images with certain + width and height attributes (standard banner sizes or web-bugs), + or just to have fun. Enabled content filters are applied to any content whose + "Content Type" header is recognised as a sign + of text-based content, with the exception of FILTER: html-annoyances - - # New browser windows should be resizeable and have a location and status - # bar. Make it so. - # - s/resizable="?(no|0)"?/resizable=1/ig s/noresize/yesresize/ig - s/location="?(no|0)"?/location=1/ig s/status="?(no|0)"?/status=1/ig - s/scrolling="?(no|0|Auto)"?/scrolling=1/ig - s/menubar="?(no|0)"?/menubar=1/ig - - # The <BLINK> tag was a crime! - # - s*<blink>|</blink>**ig - - # Is this evil? - # - #s/framespacing="?(no|0)"?//ig - #s/margin(height|width)=[0-9]*//gi - - text/plain. + Use the force-text-mode action + to also filter other content. Just for kicks, replace any occurrence of "Microsoft" with - Substitutions are made at the source level, so if you want to "MicroSuck", and have a little fun with topical buzzwords: "roll + your own" filters, you should first be familiar with HTML syntax, + and, of course, regular expressions. Just like the actions files, the + filter file is organized in sections, which are called filters + here. Each filter consists of a heading line, that starts with one of the + keywords FILTER: fun - - s/microsoft(?!.com)/MicroSuck/ig - - # Buzzword Bingo: - # - s/industry-leading|cutting-edge|award-winning/<font color=red><b>BINGO!</b></font>/ig - - FILTER:, + CLIENT-HEADER-FILTER: or SERVER-HEADER-FILTER: + followed by the filter's name, and a short (one line) + description of what it does. Below that line + come the jobs, i.e. lines that define the actual + text substitutions. By convention, the name of a filter + should describe what the filter eliminates. The + comment is used in the web-based + user interface.

Once a filter called name has been defined + in the filter file, it can be invoked by using an action of the form + +filter{name} + in any actions file.

Filter definitions start with a header line that contains the filter + type, the filter name and the filter description. + A content filter header line for a filter called "foo" could look + like this:

Privoxy User Manual
Prev	Next

FILTER: foo Replace all "foo" with "bar"

Kill those pesky little web-bugs:

Below that line, and up to the next header line, come the jobs that + define what text replacements the filter executes. They are specified + in a syntax that imitates Perl's + s/// operator. If you are familiar with Perl, you + will find this to be quite intuitive, and may want to look at the + PCRS documentation for the subtle differences to Perl behaviour. Most + notably, the non-standard option letter U is supported, + which turns the default to ungreedy matching.

If you are new to + "Regular + Expressions", you might want to take a look at + the Appendix on regular expressions, and + see the Perl + manual for + the +

# webbugs: Squish WebBugs (1x1 invisible GIFs used for user tracking) - FILTER: webbugs - - s/<img\s+[^>]*?(width|height)\s*=\s*['"]?1\D[^>]*?(width|height)\s*=\s*['"]?1(\D[^>]*?)?>//sig -

- s/// operator's syntax and Perl-style regular + expressions in general. + The below examples might also help to get you started.

9.1. The +filter Action9.1. Filter File Tutorial

Filters are enabled with the Now, let's complete our "+filter" action from within - one of the actions files. "foo" content filter. We have already defined + the heading, but the jobs are still missing. Since all it does is to replace + "+filter" requires one parameter, which - should match one of the section identifiers in the filter file itself. Example:

"foo" with "bar", there is only one (trivial) job + needed:

  +filter{html-annoyances}

s/foo/bar/

This would activate that particular filter. Similarly, But wait! Didn't the comment say that all occurrences + of "foo" should be replaced? Our current job will only take + care of the first "+filter""foo" on each page. For global substitution, + we'll need to add the g option:

s/foo/bar/g

Our complete filter now looks like this:

FILTER: foo Replace all "foo" with "bar" +s/foo/bar/g

Let's look at some real filters for more interesting examples. Here you see + a filter that protects against some common annoyances that arise from JavaScript + abuse. Let's look at its jobs one after the other:

FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse + +# Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm +# +s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg

Following the header line and a comment, you see the job. Note that it uses + | as the delimiter instead of /, because + the pattern contains a forward slash, which would otherwise have to be escaped + by a backslash (\).

Now, let's examine the pattern: it starts with the text <script.* + enclosed in parentheses. Since the dot matches any character, and * + means: "Match an arbitrary number of the element left of myself", this + matches "<script", followed by any text, i.e. + it matches the whole page, from the start of the first <script> tag.

That's more than we want, but the pattern continues: document\.referrer + matches only the exact string "document.referrer". The dot needed to + be escaped, i.e. preceded by a backslash, to take away its + special meaning as a joker, and make it just a regular dot. So far, the meaning is: + Match from the start of the first <script> tag in a the page, up to, and including, + the text "document.referrer", if both are present + in the page (and appear in that order).

But there's still more pattern to go. The next element, again enclosed in parentheses, + is .*</script>. You already know what .* + means, so the whole pattern translates to: Match from the start of the first <script> + tag in a page to the end of the last <script> tag, provided that the text + "document.referrer" appears somewhere in between.

This is still not the whole story, since we have ignored the options and the parentheses: + The portions of the page matched by sub-patterns that are enclosed in parentheses, will be + remembered and be available through the variables $1, $2, ... in + the substitute. The U option switches to ungreedy matching, which means + that the first .* in the pattern will only "eat up" all + text in between "<script" and the first occurrence + of "document.referrer", and that the second .* will + only span the text up to the first "</script>" + tag. Furthermore, the s option says that the match may span + multiple lines in the page, and the g option again means that the + substitution is global.

So, to summarize, the pattern means: Match all scripts that contain the text + "document.referrer". Remember the parts of the script from + (and including) the start tag up to (and excluding) the string + "document.referrer" as $1, and the part following + that string, up to and including the closing tag, as $2.

Now the pattern is deciphered, but wasn't this about substituting things? So + lets look at the substitute: $1"Not Your Business!"$2 is + easy to read: The text remembered as $1, followed by + "Not Your Business!" (including + the quotation marks!), followed by the text remembered as $2. + This produces an exact copy of the original string, with the middle part + (the "document.referrer") replaced by "Not Your + Business!".

The whole job now reads: Replace "document.referrer" by + "Not Your Business!" wherever it appears inside a + <script> tag. Note that this job won't break JavaScript syntax, + since both the original and the replacement are syntactically valid + string objects. The script just won't have access to the referrer + information anymore.

We'll show you two other jobs from the JavaScript taming department, but + this time only point out the constructs of special interest:

# The status bar is for displaying link targets, not pointless blahblah +# +s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig

\s stands for whitespace characters (space, tab, newline, + carriage return, form feed), so that \s* means: "zero + or more whitespace". The ? in .*? - can be turned off for selected sites as: + makes this matching of arbitrary text ungreedy. (Note that the U + option is not set). The ['"] construct means: "a single or a double quote". Finally, \1 is + a back-reference to the first parenthesis just like $1 above, + with the difference that in the pattern, a backslash indicates + a back-reference, whereas in the substitute, it's the dollar.

So what does this job do? It replaces assignments of single- or double-quoted + strings to the "-filter{html-annoyances}". Remember too, all actions are off by - default, unless they are explicity enabled in one of the actions files.

"window.status" object with a dummy assignment + (using a variable name that is hopefully odd enough not to conflict with + real variables in scripts). Thus, it catches many cases where e.g. pointless + descriptions are displayed in the status bar instead of the link target when + you move your mouse over links.

# Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html +# +s/(<body [^>]*)onunload(.*>)/$1never$2/iU

Including the + OnUnload + event binding in the HTML DOM was a CRIME. + When I close a browser window, I want it to close and die. Basta. + This job replaces the "onunload" attribute in + "<body>" tags with the dummy word never. + Note that the i option makes the pattern matching + case-insensitive. Also note that ungreedy matching alone doesn't always guarantee + a minimal match: In the first parenthesis, we had to use [^>]* + instead of .* to prevent the match from exceeding the + <body> tag if it doesn't contain "OnUnload", but the page's + content does.

The last example is from the fun department:

FILTER: fun Fun text replacements + +# Spice the daily news: +# +s/microsoft(?!\.com)/MicroSuck/ig

Note the (?!\.com) part (a so-called negative lookahead) + in the job's pattern, which means: Don't match, if the string + ".com" appears directly following "microsoft" + in the page. This prevents links to microsoft.com from being trashed, while + still replacing the word everywhere else.

The x option in this job turns on extended syntax, and allows for + e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.

You get the idea?

9.2. The Pre-defined Filters

The distribution default.filter file contains a selection of +pre-defined filters for your convenience:

js-annoyances

The purpose of this filter is to get rid of particularly annoying JavaScript abuse. + To that end, it +

replaces JavaScript references to the browser's referrer information + with the string "Not Your Business!". This compliments the hide-referrer action on the content level. +
removes the bindings to the DOM's + unload + event which we feel has no right to exist and is responsible for most "exit consoles", i.e. + nasty windows that pop up when you close another one. +
removes code that causes new windows to be opened with undesired properties, such as being + full-screen, non-resizeable, without location, status or menu bar etc. +

Use with caution. This is an aggressive filter, and can break sites that + rely heavily on JavaScript. +

js-events

This is a very radical measure. It removes virtually all JavaScript event bindings, which + means that scripts can not react to user actions such as mouse movements or clicks, window + resizing etc, anymore. Use with caution! +

We strongly discourage using this filter as a default since it breaks + many legitimate scripts. It is meant for use only on extra-nasty sites (should you really + need to go there). +

html-annoyances

This filter will undo many common instances of HTML based abuse. +

The BLINK and MARQUEE tags + are neutralized (yeah baby!), and browser windows will be created as + resizeable (as of course they should be!), and will have location, + scroll and menu bars -- even if specified otherwise. +

content-cookies

Most cookies are set in the HTTP dialog, where they can be intercepted + by the + crunch-incoming-cookies + and crunch-outgoing-cookies + actions. But web sites increasingly make use of HTML meta tags and JavaScript + to sneak cookies to the browser on the content level. +

This filter disables most HTML and JavaScript code that reads or sets + cookies. It cannot detect all clever uses of these types of code, so it + should not be relied on as an absolute fix. Use it wherever you would also + use the cookie crunch actions. +

refresh tags

Disable any refresh tags if the interval is greater than nine seconds (so + that redirections done via refresh tags are not destroyed). This is useful + for dial-on-demand setups, or for those who find this HTML feature + annoying. +

unsolicited-popups

This filter attempts to prevent only "unsolicited" pop-up + windows from opening, yet still allow pop-up windows that the user + has explicitly chosen to open. It was added in version 3.0.1, + as an improvement over earlier such filters. +

Technical note: The filter works by redefining the window.open JavaScript + function to a dummy function, PrivoxyWindowOpen(), + during the loading and rendering phase of each HTML page access, and + restoring the function afterward. +

This is recommended only for browsers that cannot perform this function + reliably themselves. And be aware that some sites require such windows + in order to function normally. Use with caution. +

all-popups

Attempt to prevent all pop-up windows from opening. + Note this should be used with even more discretion than the above, since + it is more likely to break some sites that require pop-ups for normal + usage. Use with caution. +

img-reorder

This is a helper filter that has no value if used alone. It makes the + banners-by-size and banners-by-link + (see below) filters more effective and should be enabled together with them. +

banners-by-size

This filter removes image tags purely based on what size they are. Fortunately + for us, many ads and banner images tend to conform to certain standardized + sizes, which makes this filter quite effective for ad stripping purposes. +

Occasionally this filter will cause false positives on images that are not ads, + but just happen to be of one of the standard banner sizes. +

Recommended only for those who require extreme ad blocking. The default + block rules should catch 95+% of all ads without this filter enabled. +

banners-by-link

This is an experimental filter that attempts to kill any banners if + their URLs seem to point to known or suspected click trackers. It is currently + not of much value and is not recommended for use by default. +

webbugs

Webbugs are small, invisible images (technically 1X1 GIF images), that + are used to track users across websites, and collect information on them. + As an HTML page is loaded by the browser, an embedded image tag causes the + browser to contact a third-party site, disclosing the tracking information + through the requested URL and/or cookies for that third-party domain, without + the user ever becoming aware of the interaction with the third-party site. + HTML-ized spam also uses a similar technique to verify email addresses. +

This filter removes the HTML code that loads such "webbugs". +

tiny-textforms

A rather special-purpose filter that can be used to enlarge textareas (those + multi-line text boxes in web forms) and turn off hard word wrap in them. + It was written for the sourceforge.net tracker system where such boxes are + a nuisance, but it can be handy on other sites, too. +

It is not recommended to use this filter as a default. +

jumping-windows

Many consider windows that move, or resize themselves to be abusive. This filter + neutralizes the related JavaScript code. Note that some sites might not display + or behave as intended when using this filter. Use with caution. +

frameset-borders

Some web designers seem to assume that everyone in the world will view their + web sites using the same browser brand and version, screen resolution etc, + because only that assumption could explain why they'd use static frame sizes, + yet prevent their frames from being resized by the user, should they be too + small to show their whole content. +

This filter removes the related HTML code. It should only be applied to sites + which need it. +

demoronizer

Many Microsoft products that generate HTML use non-standard extensions (read: + violations) of the ISO 8859-1 aka Latin-1 character set. This can cause those + HTML documents to display with errors on standard-compliant platforms. +

This filter translates the MS-only characters into Latin-1 equivalents. + It is not necessary when using MS products, and will cause corruption of + all documents that use 8-bit character sets other than Latin-1. It's mostly + worthwhile for Europeans on non-MS platforms, if weird garbage characters + sometimes appear on some pages, or user agents that don't correct for this on + the fly. + +

shockwave-flash

A filter for shockwave haters. As the name suggests, this filter strips code + out of web pages that is used to embed shockwave flash objects. +

quicktime-kioskmode

Change HTML code that embeds Quicktime objects so that kioskmode, which + prevents saving, is disabled. +

fun

Text replacements for subversive browsing fun. Make fun of your favorite + Monopolist or play buzzword bingo. +

crude-parental

A demonstration-only filter that shows how Privoxy + can be used to delete web content on a keyword basis. +

ie-exploits

An experimental collection of text replacements to disable malicious HTML and JavaScript + code that exploits known security holes in Internet Explorer. +

Presently, it only protects against Nimda and a cross-site scripting bug, and + would need active maintenance to provide more substantial protection. +

site-specifics

Some web sites have very specific problems, the cure for which doesn't apply + anywhere else, or could even cause damage on other sites. +

This is a collection of such site-specific cures which should only be applied + to the sites they were intended for, which is what the supplied + default.action file does. Users shouldn't need to change + anything regarding this filter. +

google

A CSS based block for Google text ads. Also removes a width limitation + and the toolbar advertisement. +

yahoo

Another CSS based block, this time for Yahoo text ads. And removes + a width limitation as well. +

msn

Another CSS based block, this time for MSN text ads. And removes + tracking URLs, as well as a width limitation. +

blogspot

Cleans up some Blogspot blogs. Read the fine print before using this one! +

This filter also intentionally removes some navigation stuff and sets the + page width to 100%. As a result, some rounded "corners" would + appear to early or not at all and as fixing this would require a browser + that understands background-size (CSS3), they are removed instead. +

xml-to-html

Server-header filter to change the Content-Type from xml to html. +

html-to-xml

Server-header filter to change the Content-Type from html to xml. +

no-ping

Removes the non-standard ping attribute from + anchor and area HTML tags. +

hide-tor-exit-notation

Client-header filter to remove the Tor exit node notation + found in Host and Referer headers. +

If Privoxy and Tor are chained and Privoxy + is configured to use socks4a, one can use "http://www.example.org.foobar.exit/" + to access the host "www.example.org" through the + Tor exit node "foobar". +

As the HTTP client isn't aware of this notation, it treats the + whole string "www.example.org.foobar.exit" as host and uses it + for the "Host" and "Referer" headers. From the + server's point of view the resulting headers are invalid and can cause problems. +

An invalid "Referer" header can trigger "hot-linking" + protections, an invalid "Host" header will make it impossible for + the server to find the right vhost (several domains hosted on the same IP address). +

This client-header filter removes the "foo.exit" part in those headers + to prevent the mentioned problems. Note that it only modifies + the HTTP headers, it doesn't make it impossible for the server + to detect your Tor exit node based on the IP address + the request is coming from. +