X-Git-Url: http://www.privoxy.org/gitweb/?p=privoxy.git;a=blobdiff_plain;f=doc%2Fwebserver%2Fuser-manual%2Ffilter-file.html;h=e0de605f812edb6c1748feec413339289d56e5e3;hp=2ca2bb5b84677580d9a25f161ab560384ab9142e;hb=473cfd051580edfa1e2a3f6beeb9a0d09a8253fd;hpb=a5b1999794b4b0faa68812c0b8b2861316ae8341 diff --git a/doc/webserver/user-manual/filter-file.html b/doc/webserver/user-manual/filter-file.html index 2ca2bb5b..e0de605f 100644 --- a/doc/webserver/user-manual/filter-file.html +++ b/doc/webserver/user-manual/filter-file.html @@ -1,23 +1,28 @@ + The Filter FileFilter Files + +Privoxy 3.1.1 User ManualPrivoxy 3.0.9 User Manual

9. The Filter File

9. Filter Files

All text substitutions that can be invoked through the +> On-the-fly text substitutions need + to be defined in a "filter file". Once defined, they + can then be invoked as an "action".

Privoxy supports three different filter actions: filter action - must first be defined in the filter file, which is typically - called default.filter and which can be - selected through the to + rewrite the content that is send to the client, + client-header-filter + to rewrite headers that are send by the client, and + server-header-filter + to rewrite headers that are send by the server.

Privoxy also supports two tagger actions: + client-header-tagger + and + server-header-tagger. + Taggers and filters use the same syntax in the filter files, the difference + is that taggers don't modify the text they are filtering, but use a rewritten + version of the filtered text as tag. The tags can then be used to change the + applying actions through sections with tag-patterns.

Multiple filter files can be defined through the filterfile config - option.

config directive. The filters + as supplied by the developers are located in + default.filter. It is recommended that any locally + defined or modified filters go in a separately defined file such as + user.filter. +

Typical reasons for doing such substitutions are to eliminate - common annoyances in HTML and JavaScript, such as pop-up windows, +> Common tasks for content filters are to eliminate common annoyances in + HTML and JavaScript, such as pop-up windows, exit consoles, crippled windows without navigation tools, the infamous <BLINK> tag etc, to suppress images with certain width and height attributes (standard banner sizes or web-bugs), - or just to have fun. The possibilities are endless.

Filtering works on any text-based document type, including plain - text, HTML, JavaScript, CSS etc. (all Enabled content filters are applied to any content whose + "Content Type" header is recognised as a sign + of text-based content, with the exception of text/* - MIME types). Substitutions are made at the source level, so if - you want to text/plain. + Use the force-text-mode action + to also filter other content.

Substitutions are made at the source level, so if you want to "roll your own" filters, you should be - familiar with HTML syntax.

"roll + your own" filters, you should first be familiar with HTML syntax, + and, of course, regular expressions.

Just like the filters - here. Each filter consists of a heading line, that starts with the + here. Each filter consists of a heading line, that starts with one of the keywordkeywords FILTER:, followed by - the filter's , + CLIENT-HEADER-FILTER: or SERVER-HEADER-FILTER: + followed by the filter's actions file.

A filter header line for a filter called Filter definitions start with a header line that contains the filter + type, the filter name and the filter description. + A content filter header line for a filter called "foo" could look @@ -230,31 +319,35 @@ CLASS="LITERAL" >s/// operator. If you are familiar with Perl, you will find this to be quite intuitive, and may want to look at the - PCRS man page - for the subtle differences to Perl behaviour. Most notably, the non-standard - option letter U is supported, which turns the default - to ungreedy matching.

is supported, + which turns the default to ungreedy matching.

If you are new to regular expressions, you might want to take a look at +> If you are new to + "Regular + Expressions", you might want to take a look at the Appendix on regular expressions, and see the Perl manual for the s/// operator's syntax and Perl-style regular expressions

9.1. Filter File Tutorial

9.1. Filter File Tutorial

Now, let's complete our "foo" filter. We have already defined +> content filter. We have already defined the heading, but the jobs are still missing. Since all it does is to replace \1 is - a backreference to the first parenthesis just like $1 above, @@ -645,7 +740,7 @@ CLASS="EMPHASIS" >pattern, a backslash indicates - a backreference, whereas in the

You get the idea?

9.2. The Pre-defined Filters

The distribution default.filter file contains a selection of +pre-defined filters for your convenience:

js-annoyances

The purpose of this filter is to get rid of particularly annoying JavaScript abuse. + To that end, it +

  • replaces JavaScript references to the browser's referrer information + with the string "Not Your Business!". This compliments the hide-referrer action on the content level. +

  • removes the bindings to the DOM's + unload + event which we feel has no right to exist and is responsible for most "exit consoles", i.e. + nasty windows that pop up when you close another one. +

  • removes code that causes new windows to be opened with undesired properties, such as being + full-screen, non-resizeable, without location, status or menu bar etc. +

+

Use with caution. This is an aggressive filter, and can break sites that + rely heavily on JavaScript. +

js-events

This is a very radical measure. It removes virtually all JavaScript event bindings, which + means that scripts can not react to user actions such as mouse movements or clicks, window + resizing etc, anymore. Use with caution! +

We strongly discourage using this filter as a default since it breaks + many legitimate scripts. It is meant for use only on extra-nasty sites (should you really + need to go there). +

html-annoyances

This filter will undo many common instances of HTML based abuse. +

The BLINK and MARQUEE tags + are neutralized (yeah baby!), and browser windows will be created as + resizeable (as of course they should be!), and will have location, + scroll and menu bars -- even if specified otherwise. +

content-cookies

Most cookies are set in the HTTP dialog, where they can be intercepted + by the + crunch-incoming-cookies + and crunch-outgoing-cookies + actions. But web sites increasingly make use of HTML meta tags and JavaScript + to sneak cookies to the browser on the content level. +

This filter disables most HTML and JavaScript code that reads or sets + cookies. It cannot detect all clever uses of these types of code, so it + should not be relied on as an absolute fix. Use it wherever you would also + use the cookie crunch actions. +

refresh tags

Disable any refresh tags if the interval is greater than nine seconds (so + that redirections done via refresh tags are not destroyed). This is useful + for dial-on-demand setups, or for those who find this HTML feature + annoying. +

unsolicited-popups

This filter attempts to prevent only "unsolicited" pop-up + windows from opening, yet still allow pop-up windows that the user + has explicitly chosen to open. It was added in version 3.0.1, + as an improvement over earlier such filters. +

Technical note: The filter works by redefining the window.open JavaScript + function to a dummy function, PrivoxyWindowOpen(), + during the loading and rendering phase of each HTML page access, and + restoring the function afterward. +

This is recommended only for browsers that cannot perform this function + reliably themselves. And be aware that some sites require such windows + in order to function normally. Use with caution. +

all-popups

Attempt to prevent all pop-up windows from opening. + Note this should be used with even more discretion than the above, since + it is more likely to break some sites that require pop-ups for normal + usage. Use with caution. +

img-reorder

This is a helper filter that has no value if used alone. It makes the + banners-by-size and banners-by-link + (see below) filters more effective and should be enabled together with them. +

banners-by-size

This filter removes image tags purely based on what size they are. Fortunately + for us, many ads and banner images tend to conform to certain standardized + sizes, which makes this filter quite effective for ad stripping purposes. +

Occasionally this filter will cause false positives on images that are not ads, + but just happen to be of one of the standard banner sizes. +

Recommended only for those who require extreme ad blocking. The default + block rules should catch 95+% of all ads without this filter enabled. +

banners-by-link

This is an experimental filter that attempts to kill any banners if + their URLs seem to point to known or suspected click trackers. It is currently + not of much value and is not recommended for use by default. +

webbugs

Webbugs are small, invisible images (technically 1X1 GIF images), that + are used to track users across websites, and collect information on them. + As an HTML page is loaded by the browser, an embedded image tag causes the + browser to contact a third-party site, disclosing the tracking information + through the requested URL and/or cookies for that third-party domain, without + the user ever becoming aware of the interaction with the third-party site. + HTML-ized spam also uses a similar technique to verify email addresses. +

This filter removes the HTML code that loads such "webbugs". +

tiny-textforms

A rather special-purpose filter that can be used to enlarge textareas (those + multi-line text boxes in web forms) and turn off hard word wrap in them. + It was written for the sourceforge.net tracker system where such boxes are + a nuisance, but it can be handy on other sites, too. +

It is not recommended to use this filter as a default. +

jumping-windows

Many consider windows that move, or resize themselves to be abusive. This filter + neutralizes the related JavaScript code. Note that some sites might not display + or behave as intended when using this filter. Use with caution. +

frameset-borders

Some web designers seem to assume that everyone in the world will view their + web sites using the same browser brand and version, screen resolution etc, + because only that assumption could explain why they'd use static frame sizes, + yet prevent their frames from being resized by the user, should they be too + small to show their whole content. +

This filter removes the related HTML code. It should only be applied to sites + which need it. +

demoronizer

Many Microsoft products that generate HTML use non-standard extensions (read: + violations) of the ISO 8859-1 aka Latin-1 character set. This can cause those + HTML documents to display with errors on standard-compliant platforms. +

This filter translates the MS-only characters into Latin-1 equivalents. + It is not necessary when using MS products, and will cause corruption of + all documents that use 8-bit character sets other than Latin-1. It's mostly + worthwhile for Europeans on non-MS platforms, if weird garbage characters + sometimes appear on some pages, or user agents that don't correct for this on + the fly. + +

shockwave-flash

A filter for shockwave haters. As the name suggests, this filter strips code + out of web pages that is used to embed shockwave flash objects. +

quicktime-kioskmode

Change HTML code that embeds Quicktime objects so that kioskmode, which + prevents saving, is disabled. +

fun

Text replacements for subversive browsing fun. Make fun of your favorite + Monopolist or play buzzword bingo. +

crude-parental

A demonstration-only filter that shows how Privoxy + can be used to delete web content on a keyword basis. +

ie-exploits

An experimental collection of text replacements to disable malicious HTML and JavaScript + code that exploits known security holes in Internet Explorer. +

Presently, it only protects against Nimda and a cross-site scripting bug, and + would need active maintenance to provide more substantial protection. +

site-specifics

Some web sites have very specific problems, the cure for which doesn't apply + anywhere else, or could even cause damage on other sites. +

This is a collection of such site-specific cures which should only be applied + to the sites they were intended for, which is what the supplied + default.action file does. Users shouldn't need to change + anything regarding this filter. +

google

A CSS based block for Google text ads. Also removes a width limitation + and the toolbar advertisement. +

yahoo

Another CSS based block, this time for Yahoo text ads. And removes + a width limitation as well. +

msn

Another CSS based block, this time for MSN text ads. And removes + tracking URLs, as well as a width limitation. +

blogspot

Cleans up some Blogspot blogs. Read the fine print before using this one! +

This filter also intentionally removes some navigation stuff and sets the + page width to 100%. As a result, some rounded "corners" would + appear to early or not at all and as fixing this would require a browser + that understands background-size (CSS3), they are removed instead. +

xml-to-html

Server-header filter to change the Content-Type from xml to html. +

html-to-xml

Server-header filter to change the Content-Type from html to xml. +

no-ping

Removes the non-standard ping attribute from + anchor and area HTML tags. +

hide-tor-exit-notation

Client-header filter to remove the Tor exit node notation + found in Host and Referer headers. +

If Privoxy and Tor are chained and Privoxy + is configured to use socks4a, one can use "http://www.example.org.foobar.exit/" + to access the host "www.example.org" through the + Tor exit node "foobar". +

As the HTTP client isn't aware of this notation, it treats the + whole string "www.example.org.foobar.exit" as host and uses it + for the "Host" and "Referer" headers. From the + server's point of view the resulting headers are invalid and can cause problems. +

An invalid "Referer" header can trigger "hot-linking" + protections, an invalid "Host" header will make it impossible for + the server to find the right vhost (several domains hosted on the same IP address). +

This client-header filter removes the "foo.exit" part in those headers + to prevent the mentioned problems. Note that it only modifies + the HTTP headers, it doesn't make it impossible for the server + to detect your Tor exit node based on the IP address + the request is coming from. +