X-Git-Url: http://www.privoxy.org/gitweb/?p=privoxy.git;a=blobdiff_plain;f=doc%2Fwebserver%2Fuser-manual%2Ffilter-file.html;h=6689c04b366e4bb3590e66315aa10428bed03b7d;hp=ad6f8087efe5dc372ce4e1d6805fe242f32dddc4;hb=f8dbc81f51ddf04121644ad5da727f94f3ad11a5;hpb=4fef8970e4e382c21598949e7b035e596a4a8048 diff --git a/doc/webserver/user-manual/filter-file.html b/doc/webserver/user-manual/filter-file.html index ad6f8087..6689c04b 100644 --- a/doc/webserver/user-manual/filter-file.html +++ b/doc/webserver/user-manual/filter-file.html @@ -1,51 +1,43 @@ - Filter Files - + - + - -

9. Filter Files

-

On-the-fly text substitutions need to be defined in a "filter file". Once defined, they can then be invoked as an "action".

-

Privoxy supports three different - filter actions: filter to rewrite the content that is send to the client, client-header-filter to @@ -53,7 +45,6 @@ "LITERAL">server-header-filter to rewrite headers that are send by the server.

-

Privoxy also supports two tagger actions: client-header-tagger @@ -64,31 +55,32 @@ use a rewritten version of the filtered text as tag. The tags can then be used to change the applying actions through sections with tag-patterns.

- +

Finally Privoxy supports the + external-filter action to + enable external filters + written in proper programming languages.

Multiple filter files can be defined through the filterfile config directive. The filters as supplied by the developers are located in default.filter. It is recommended that any locally defined or modified filters go in a separately defined file such as user.filter.

-

Common tasks for content filters are to eliminate common annoyances in HTML and JavaScript, such as pop-up windows, exit consoles, crippled windows without navigation tools, the infamous <BLINK> tag etc, to suppress images with certain width and height attributes (standard banner sizes or web-bugs), or just to have fun.

-

Enabled content filters are applied to any content whose "Content Type" header is recognised as a sign of text-based content, with the exception of text/plain. Use the force-text-mode action to also filter other content.

-

Substitutions are made at the source level, so if you want to "roll your own" filters, you should first be familiar with HTML syntax, and, of course, regular expressions.

-

Just like the actions files, the filter file is organized in sections, which are called filters here. Each filter @@ -105,39 +97,48 @@ "emphasis">eliminates. The comment is used in the web-based user interface.

-

Once a filter called name has been defined in the filter file, it can be invoked by using an action of the form +filter{name} in any actions file.

-

Filter definitions start with a header line that contains the filter type, the filter name and the filter description. A content filter header line for a filter called "foo" could look like this:

-
-
-FILTER: foo Replace all "foo" with "bar"
-
+
FILTER: foo Replace all "foo" with "bar"
-

Below that line, and up to the next header line, come the jobs that define what text replacements the filter executes. They are specified in a syntax that imitates Perl's s/// operator. If you are familiar with Perl, you will find this to be quite intuitive, and may want to look at the PCRS documentation for the subtle differences to Perl - behaviour. Most notably, the non-standard option letter +

Most notably, the non-standard option letter U is supported, which turns the default to ungreedy - matching.

- + matching (add ? to quantifiers to turn them + greedy again).

+

The non-standard option letter D (dynamic) + allows to use the variables $host, $origin (the IP address the request + came from), $path, $url and $listen-address (the address on which Privoxy + accepted the client request. Example: 127.0.0.1:8118). They will be + replaced with the value they refer to before the filter is executed.

+

Note that '$' is a bad choice for a delimiter in a dynamic filter as + you might end up with unintended variables if you use a variable name + directly after the delimiter. Variables will be resolved without escaping + anything, therefore you also have to be careful not to chose delimiters + that appear in the replacement text. For example '<' should be save, + while '?' will sooner or later cause conflicts with $url.

+

The non-standard option letter T (trivial) + prevents parsing for backreferences in the substitute. Use it if you want + to include text like '$&' in your substitute without quoting.

If you are new to "Regular Expressions", you might @@ -149,82 +150,64 @@ FILTER: foo Replace all "foo" with "bar" "http://perldoc.perl.org/perlre.html" target="_top">Perl-style regular expressions in general. The below examples might also help to get you started.

-
-

9.1. Filter File - Tutorial

- +

9.1. + Filter File Tutorial

Now, let's complete our "foo" content filter. We have already defined the heading, but the jobs are still missing. Since all it does is to replace "foo" with "bar", there is only one (trivial) job needed:

-
-
-s/foo/bar/
-
+
s/foo/bar/
-

But wait! Didn't the comment say that all occurrences of "foo" should be replaced? Our current job will only take care of the first "foo" on each page. For global substitution, we'll need to add the g option:

-
-
-s/foo/bar/g
-
+
s/foo/bar/g
-

Our complete filter now looks like this:

-
-
-FILTER: foo Replace all "foo" with "bar"
-s/foo/bar/g
-
+
FILTER: foo Replace all "foo" with "bar"
+s/foo/bar/g
-

Let's look at some real filters for more interesting examples. Here you see a filter that protects against some common annoyances that arise from JavaScript abuse. Let's look at its jobs one after the other:

-
-FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
+            FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
 
 # Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
 #
-s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg
-
+s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg
-

Following the header line and a comment, you see the job. Note that it uses | as the delimiter instead of /, because the pattern contains a forward slash, which would otherwise have to be escaped by a backslash (\).

-

Now, let's examine the pattern: it starts with the text <script.* enclosed in parentheses. Since the dot matches any character, and * means: @@ -233,7 +216,6 @@ s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|U followed by any text, i.e. it matches the whole page, from the start of the first <script> tag.

-

That's more than we want, but the pattern continues: document\.referrer matches only the exact string "document.referrer". The dot needed to be @@ -244,7 +226,6 @@ s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|U including, the text "document.referrer", if both are present in the page (and appear in that order).

-

But there's still more pattern to go. The next element, again enclosed in parentheses, is .*</script>. You already know what .* means, so the whole @@ -252,7 +233,6 @@ s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|U tag in a page to the end of the last <script> tag, provided that the text "document.referrer" appears somewhere in between.

-

This is still not the whole story, since we have ignored the options and the parentheses: The portions of the page matched by sub-patterns that are enclosed in parentheses, will be remembered and be available @@ -269,14 +249,12 @@ s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|U s option says that the match may span multiple lines in the page, and the g option again means that the substitution is global.

-

So, to summarize, the pattern means: Match all scripts that contain the text "document.referrer". Remember the parts of the script from (and including) the start tag up to (and excluding) the string "document.referrer" as $1, and the part following that string, up to and including the closing tag, as $2.

-

Now the pattern is deciphered, but wasn't this about substituting things? So lets look at the substitute: $1"Not Your Business!"$2 is easy to read: The text remembered as "document.referrer") replaced by "Not Your Business!".

-

The whole job now reads: Replace "document.referrer" by "Not Your Business!" wherever it appears inside a <script> tag. Note that this job won't break JavaScript syntax, since both the original and the replacement are syntactically valid string objects. The script just won't have access to the referrer information anymore.

-

We'll show you two other jobs from the JavaScript taming department, but this time only point out the constructs of special interest:

-
-# The status bar is for displaying link targets, not pointless blahblah
+            # The status bar is for displaying link targets, not pointless blahblah
 #
-s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig
-
+s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig
-

\s stands for whitespace characters (space, tab, newline, carriage return, form feed), so that \s* means: "zero or more @@ -325,7 +298,6 @@ s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig indicates a back-reference, whereas in the substitute, it's the dollar.

-

So what does this job do? It replaces assignments of single- or double-quoted strings to the "window.status" object with a dummy assignment (using a variable name that is hopefully @@ -333,19 +305,16 @@ s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig catches many cases where e.g. pointless descriptions are displayed in the status bar instead of the link target when you move your mouse over links.

-
-# Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
+            # Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
 #
-s/(<body [^>]*)onunload(.*>)/$1never$2/iU
-
+s/(<body [^>]*)onunload(.*>)/$1never$2/iU
-

Including the OnUnload event binding in the HTML DOM was a @@ -361,35 +330,29 @@ s/(<body [^>]*)onunload(.*>)/$1never$2/iU prevent the match from exceeding the <body> tag if it doesn't contain "OnUnload", but the page's content does.

-

The last example is from the fun department:

-
-
-FILTER: fun Fun text replacements
+            
FILTER: fun Fun text replacements
 
 # Spice the daily news:
 #
-s/microsoft(?!\.com)/MicroSuck/ig
-
+s/microsoft(?!\.com)/MicroSuck/ig
-

Note the (?!\.com) part (a so-called negative lookahead) in the job's pattern, which means: Don't match, if the string ".com" appears directly following "microsoft" in the page. This prevents links to microsoft.com from being trashed, while still replacing the word everywhere else.

-
-# Buzzword Bingo (example for extended regex syntax)
+            # Buzzword Bingo (example for extended regex syntax)
 #
 s* industry[ -]leading \
 |  cutting[ -]edge \
@@ -402,35 +365,27 @@ s* industry[ -]leading \
 |  unparalleled \
 |  unrivalled \
 *<font color="red"><b>BINGO!</b></font> \
-*igx
-
+*igx
-

The x option in this job turns on extended syntax, and allows for e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.

-

You get the idea?

-

9.2. The Pre-defined Filters

-

The distribution default.filter file contains a selection of pre-defined filters for your convenience:

-
js-annoyances
-

The purpose of this filter is to get rid of particularly annoying JavaScript abuse. To that end, it

-
-

Use with caution. This is an aggressive filter, and can break sites that rely heavily on JavaScript.

-
js-events
-

This is a very radical measure. It removes virtually all JavaScript event bindings, which means that scripts can not react to user actions such as mouse movements or clicks, window resizing etc, anymore. Use with caution!

-

We strongly discourage using this filter as a default since it breaks many legitimate scripts. It is meant for use only on extra-nasty sites (should you really need to go there).

-
html-annoyances
-

This filter will undo many common instances of HTML based abuse.

-

The BLINK and MARQUEE tags are neutralized (yeah baby!), and browser windows will be created as resizeable (as of course they should be!), and will have location, scroll and menu bars -- even if specified otherwise.

-
content-cookies
-

Most cookies are set in the HTTP dialog, where they can be intercepted by the -

This filter disables most HTML and JavaScript code that reads or sets cookies. It cannot detect all clever uses of these types of code, so it should not be relied on as an absolute fix. Use it wherever you would also use the cookie crunch actions.

-
refresh-tags
-

Disable any refresh tags if the interval is greater than nine seconds (so that redirections done via refresh tags are not destroyed). This is useful for dial-on-demand setups, or for those who find this HTML feature annoying.

-
unsolicited-popups
-

This filter attempts to prevent only "unsolicited" pop-up windows from opening, yet still allow pop-up windows that the user has explicitly chosen to open. It was added in version 3.0.1, as an improvement over earlier such filters.

-

Technical note: The filter works by redefining the window.open JavaScript function to a dummy function, PrivoxyWindowOpen(), during the loading and rendering phase of each HTML page access, and restoring the function afterward.

-

This is recommended only for browsers that cannot perform this function reliably themselves. And be aware that some sites require such windows in order to function normally. Use with caution.

-
all-popups
-

Attempt to prevent all pop-up windows from opening. Note this @@ -550,49 +485,39 @@ s* industry[ -]leading \ is more likely to break some sites that require pop-ups for normal usage. Use with caution.

-
img-reorder
-

This is a helper filter that has no value if used alone. It makes the banners-by-size and banners-by-link (see below) filters more effective and should be enabled together with them.

-
banners-by-size
-

This filter removes image tags purely based on what size they are. Fortunately for us, many ads and banner images tend to conform to certain standardized sizes, which makes this filter quite effective for ad stripping purposes.

-

Occasionally this filter will cause false positives on images that are not ads, but just happen to be of one of the standard banner sizes.

-

Recommended only for those who require extreme ad blocking. The default block rules should catch 95+% of all ads without this filter enabled.

-
banners-by-link
-

This is an experimental filter that attempts to kill any banners if their URLs seem to point to known or suspected click trackers. It is currently not of much value and is not recommended for use by default.

-
webbugs
-

Webbugs are small, invisible images (technically 1X1 GIF images), that are used to track users across websites, and @@ -603,37 +528,29 @@ s* industry[ -]leading \ the user ever becoming aware of the interaction with the third-party site. HTML-ized spam also uses a similar technique to verify email addresses.

-

This filter removes the HTML code that loads such "webbugs".

-
tiny-textforms
-

A rather special-purpose filter that can be used to enlarge textareas (those multi-line text boxes in web forms) and turn off hard word wrap in them. It was written for the sourceforge.net tracker system where such boxes are a nuisance, but it can be handy on other sites, too.

-

It is not recommended to use this filter as a default.

-
jumping-windows
-

Many consider windows that move, or resize themselves to be abusive. This filter neutralizes the related JavaScript code. Note that some sites might not display or behave as intended when using this filter. Use with caution.

-
frameset-borders
-

Some web designers seem to assume that everyone in the world will view their web sites using the same browser brand and @@ -641,20 +558,16 @@ s* industry[ -]leading \ could explain why they'd use static frame sizes, yet prevent their frames from being resized by the user, should they be too small to show their whole content.

-

This filter removes the related HTML code. It should only be applied to sites which need it.

-
demoronizer
-

Many Microsoft products that generate HTML use non-standard extensions (read: violations) of the ISO 8859-1 aka Latin-1 character set. This can cause those HTML documents to display with errors on standard-compliant platforms.

-

This filter translates the MS-only characters into Latin-1 equivalents. It is not necessary when using MS products, and will cause corruption of all documents that use 8-bit character sets @@ -663,136 +576,104 @@ s* industry[ -]leading \ some pages, or user agents that don't correct for this on the fly.

-
shockwave-flash
-

A filter for shockwave haters. As the name suggests, this filter strips code out of web pages that is used to embed shockwave flash objects.

-
quicktime-kioskmode
-

Change HTML code that embeds Quicktime objects so that kioskmode, which prevents saving, is disabled.

-
fun
-

Text replacements for subversive browsing fun. Make fun of your favorite Monopolist or play buzzword bingo.

-
crude-parental
-

A demonstration-only filter that shows how Privoxy can be used to delete web content on a keyword basis.

-
ie-exploits
-

An experimental collection of text replacements to disable malicious HTML and JavaScript code that exploits known security holes in Internet Explorer.

-

Presently, it only protects against Nimda and a cross-site scripting bug, and would need active maintenance to provide more substantial protection.

-
site-specifics
-

Some web sites have very specific problems, the cure for which doesn't apply anywhere else, or could even cause damage on other sites.

-

This is a collection of such site-specific cures which should only be applied to the sites they were intended for, which is what the supplied default.action file does. Users shouldn't need to change anything regarding this filter.

-
google
-

A CSS based block for Google text ads. Also removes a width limitation and the toolbar advertisement.

-
yahoo
-

Another CSS based block, this time for Yahoo text ads. And removes a width limitation as well.

-
msn
-

Another CSS based block, this time for MSN text ads. And removes tracking URLs, as well as a width limitation.

-
blogspot
-

Cleans up some Blogspot blogs. Read the fine print before using this one!

-

This filter also intentionally removes some navigation stuff and sets the page width to 100%. As a result, some rounded "corners" would appear to early or not at all and as fixing this would require a browser that understands background-size (CSS3), they are removed instead.

-
xml-to-html
-

Server-header filter to change the Content-Type from xml to html.

-
html-to-xml
-

Server-header filter to change the Content-Type from html to xml.

-
no-ping
-

Removes the non-standard ping attribute from anchor and area HTML tags.

-
hide-tor-exit-notation
-

Client-header filter to remove the Tor exit node notation found in Host and Referer headers.

-

If Privoxy and Tor are chained and Privoxy is configured to use socks4a, one @@ -801,20 +682,17 @@ s* industry[ -]leading \ the host "www.example.org" through the Tor exit node "foobar".

-

As the HTTP client isn't aware of this notation, it treats the whole string "www.example.org.foobar.exit" as host and uses it for the "Host" and "Referer" headers. From the server's point of view the resulting headers are invalid and can cause problems.

-

An invalid "Referer" header can trigger "hot-linking" protections, an invalid "Host" header will make it impossible for the server to find the right vhost (several domains hosted on the same IP address).

-

This client-header filter removes the "foo.exit" part in those headers to prevent the mentioned problems. Note that it only modifies the HTTP headers, @@ -825,29 +703,91 @@ s* industry[ -]leading \

-
+
+

9.3. External filter syntax

+

External filters are scripts or programs that can modify the content + in case common filters aren't powerful enough.

+

External filters can be written in any language the platform + Privoxy runs on supports.

+

They are controlled with the external-filter action and + have to be defined in the filterfile first.

+

The header looks like any other filter, but instead of pcrs jobs, + external filters contain a single job which can be a program or a shell + script (which may call other scripts or programs).

+

External filters read the content from STDIN and write the rewritten + content to STDOUT. The environment variables PRIVOXY_URL, PRIVOXY_PATH, + PRIVOXY_HOST, PRIVOXY_ORIGIN, PRIVOXY_LISTEN_ADDRESS can be used to get + some details about the client request.

+

Privoxy will temporary store the + content to filter in the temporary-directory.

+ + + + +
+
+            EXTERNAL-FILTER: cat Pointless example filter that doesn't actually modify the content
+/bin/cat
 
+# Incorrect reimplementation of the filter above in POSIX shell.
+#
+# Note that it's a single job that spans multiple lines, the line
+# breaks are not passed to the shell, thus the semicolons are required.
+#
+# If the script isn't trivial, it is recommended to put it into an external file.
+#
+# In general, writing external filters entirely in POSIX shell is not
+# considered a good idea.
+EXTERNAL-FILTER: cat2 Pointless example filter that despite its name may actually modify the content
+while read line; \
+do \
+  echo "$line"; \
+done
+
+EXTERNAL-FILTER: rotate-image Rotate an image by 180 degree. Test filter with limited value.
+/usr/local/bin/convert - -rotate 180 -
+
+EXTERNAL-FILTER: citation-needed Adds a "[citation needed]" tag to an image. The coordinates may need adjustment.
+/usr/local/bin/convert - -pointsize 16 -fill white  -annotate +17+418 "[citation needed]" -
+
+
+ + + + + + + +
Warning
+

Currently external filters are executed with Privoxy's privileges! Only use external + filters you understand and trust.

+
+
+

External filters are experimental and the syntax may change in the + future.

+
+