X-Git-Url: http://www.privoxy.org/gitweb/?a=blobdiff_plain;f=doc%2Fwebserver%2Fuser-manual%2Ffilter-file.html;h=315f6a19318e18a4832427910c088ebd4daf6c85;hb=fe2dba94b00d45349597603a45cca6456c03fe54;hp=56c70ebe0679fc309b083d1f18261f9eb3bd9304;hpb=0212c18282eaa5f73843cbbec12c9137ea596e1c;p=privoxy.git diff --git a/doc/webserver/user-manual/filter-file.html b/doc/webserver/user-manual/filter-file.html index 56c70ebe..315f6a19 100644 --- a/doc/webserver/user-manual/filter-file.html +++ b/doc/webserver/user-manual/filter-file.html @@ -1,23 +1,25 @@ +
Privoxy User Manual | Privoxy 3.0.8 User Manual|
---|---|
Prev | Next |
FILTER: foo Replace all "foo" with "bar" |
Kill those pesky little web-bugs:
Below that line, and up to the next header line, come the jobs that + define what text replacements the filter executes. They are specified + in a syntax that imitates Perl's + s/// operator. If you are familiar with Perl, you + will find this to be quite intuitive, and may want to look at the + PCRS documentation for the subtle differences to Perl behaviour. Most + notably, the non-standard option letter U is supported, + which turns the default to ungreedy matching. If you are new to
+ "Regular
+ Expressions", you might want to take a look at
+ the Appendix on regular expressions, and
+ see the Perl
+ manual for
+ the
+ # webbugs: Squish WebBugs (1x1 invisible GIFs used for user tracking)
- FILTER: webbugs
-
- s/<img\s+[^>]*?(width|height)\s*=\s*['"]?1\D[^>]*?(width|height)\s*=\s*['"]?1(\D[^>]*?)?>/<!-- Squished WebBug -->/sig
-
Filters are enabled with the Now, let's complete our "+filter" action from within - one of the actions files. "foo" content filter. We have already defined + the heading, but the jobs are still missing. Since all it does is to replace + "+filter" requires one parameter, which - should match one of the section identifiers in the filter file itself. Example:
+filter{html-annoyances}s/foo/bar/ |
This would activate that particular filter. Similarly, But wait! Didn't the comment say that all occurrences + of "foo" should be replaced? Our current job will only take + care of the first "+filter""foo" on each page. For global substitution, + we'll need to add the g option:
s/foo/bar/g |
Our complete filter now looks like this:
FILTER: foo Replace all "foo" with "bar" +s/foo/bar/g |
Let's look at some real filters for more interesting examples. Here you see + a filter that protects against some common annoyances that arise from JavaScript + abuse. Let's look at its jobs one after the other:
FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse + +# Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm +# +s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg |
Following the header line and a comment, you see the job. Note that it uses + | as the delimiter instead of /, because + the pattern contains a forward slash, which would otherwise have to be escaped + by a backslash (\).
Now, let's examine the pattern: it starts with the text <script.* + enclosed in parentheses. Since the dot matches any character, and * + means: "Match an arbitrary number of the element left of myself", this + matches "<script", followed by any text, i.e. + it matches the whole page, from the start of the first <script> tag.
That's more than we want, but the pattern continues: document\.referrer + matches only the exact string "document.referrer". The dot needed to + be escaped, i.e. preceded by a backslash, to take away its + special meaning as a joker, and make it just a regular dot. So far, the meaning is: + Match from the start of the first <script> tag in a the page, up to, and including, + the text "document.referrer", if both are present + in the page (and appear in that order).
But there's still more pattern to go. The next element, again enclosed in parentheses, + is .*</script>. You already know what .* + means, so the whole pattern translates to: Match from the start of the first <script> + tag in a page to the end of the last <script> tag, provided that the text + "document.referrer" appears somewhere in between.
This is still not the whole story, since we have ignored the options and the parentheses: + The portions of the page matched by sub-patterns that are enclosed in parentheses, will be + remembered and be available through the variables $1, $2, ... in + the substitute. The U option switches to ungreedy matching, which means + that the first .* in the pattern will only "eat up" all + text in between "<script" and the first occurrence + of "document.referrer", and that the second .* will + only span the text up to the first "</script>" + tag. Furthermore, the s option says that the match may span + multiple lines in the page, and the g option again means that the + substitution is global.
So, to summarize, the pattern means: Match all scripts that contain the text + "document.referrer". Remember the parts of the script from + (and including) the start tag up to (and excluding) the string + "document.referrer" as $1, and the part following + that string, up to and including the closing tag, as $2.
Now the pattern is deciphered, but wasn't this about substituting things? So + lets look at the substitute: $1"Not Your Business!"$2 is + easy to read: The text remembered as $1, followed by + "Not Your Business!" (including + the quotation marks!), followed by the text remembered as $2. + This produces an exact copy of the original string, with the middle part + (the "document.referrer") replaced by "Not Your + Business!".
The whole job now reads: Replace "document.referrer" by + "Not Your Business!" wherever it appears inside a + <script> tag. Note that this job won't break JavaScript syntax, + since both the original and the replacement are syntactically valid + string objects. The script just won't have access to the referrer + information anymore.
We'll show you two other jobs from the JavaScript taming department, but + this time only point out the constructs of special interest:
# The status bar is for displaying link targets, not pointless blahblah +# +s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig |
\s stands for whitespace characters (space, tab, newline, + carriage return, form feed), so that \s* means: "zero + or more whitespace". The ? in .*? - can be turned off for selected sites as: + makes this matching of arbitrary text ungreedy. (Note that the U + option is not set). The ['"] construct means: "a single or a double quote". Finally, \1 is + a back-reference to the first parenthesis just like $1 above, + with the difference that in the pattern, a backslash indicates + a back-reference, whereas in the substitute, it's the dollar.
So what does this job do? It replaces assignments of single- or double-quoted + strings to the "-filter{html-annoyances}". Remember too, all actions are off by - default, unless they are explicity enabled in one of the actions files.
"window.status" object with a dummy assignment + (using a variable name that is hopefully odd enough not to conflict with + real variables in scripts). Thus, it catches many cases where e.g. pointless + descriptions are displayed in the status bar instead of the link target when + you move your mouse over links.
# Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html +# +s/(<body [^>]*)onunload(.*>)/$1never$2/iU |
Including the + OnUnload + event binding in the HTML DOM was a CRIME. + When I close a browser window, I want it to close and die. Basta. + This job replaces the "onunload" attribute in + "<body>" tags with the dummy word never. + Note that the i option makes the pattern matching + case-insensitive. Also note that ungreedy matching alone doesn't always guarantee + a minimal match: In the first parenthesis, we had to use [^>]* + instead of .* to prevent the match from exceeding the + <body> tag if it doesn't contain "OnUnload", but the page's + content does.
The last example is from the fun department:
FILTER: fun Fun text replacements + +# Spice the daily news: +# +s/microsoft(?!\.com)/MicroSuck/ig |
Note the (?!\.com) part (a so-called negative lookahead) + in the job's pattern, which means: Don't match, if the string + ".com" appears directly following "microsoft" + in the page. This prevents links to microsoft.com from being trashed, while + still replacing the word everywhere else.
# Buzzword Bingo (example for extended regex syntax) +# +s* industry[ -]leading \ +| cutting[ -]edge \ +| customer[ -]focused \ +| market[ -]driven \ +| award[ -]winning # Comments are OK, too! \ +| high[ -]performance \ +| solutions[ -]based \ +| unmatched \ +| unparalleled \ +| unrivalled \ +*<font color="red"><b>BINGO!</b></font> \ +*igx |
The x option in this job turns on extended syntax, and allows for + e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.
You get the idea?
The distribution default.filter file contains a selection of +pre-defined filters for your convenience:
The purpose of this filter is to get rid of particularly annoying JavaScript abuse. + To that end, it +
replaces JavaScript references to the browser's referrer information + with the string "Not Your Business!". This compliments the hide-referrer action on the content level. +
removes the bindings to the DOM's + unload + event which we feel has no right to exist and is responsible for most "exit consoles", i.e. + nasty windows that pop up when you close another one. +
removes code that causes new windows to be opened with undesired properties, such as being + full-screen, non-resizeable, without location, status or menu bar etc. +
Use with caution. This is an aggressive filter, and can break sites that + rely heavily on JavaScript. +
This is a very radical measure. It removes virtually all JavaScript event bindings, which + means that scripts can not react to user actions such as mouse movements or clicks, window + resizing etc, anymore. Use with caution! +
We strongly discourage using this filter as a default since it breaks + many legitimate scripts. It is meant for use only on extra-nasty sites (should you really + need to go there). +
This filter will undo many common instances of HTML based abuse. +
The BLINK and MARQUEE tags + are neutralized (yeah baby!), and browser windows will be created as + resizeable (as of course they should be!), and will have location, + scroll and menu bars -- even if specified otherwise. +
Most cookies are set in the HTTP dialog, where they can be intercepted + by the + crunch-incoming-cookies + and crunch-outgoing-cookies + actions. But web sites increasingly make use of HTML meta tags and JavaScript + to sneak cookies to the browser on the content level. +
This filter disables most HTML and JavaScript code that reads or sets + cookies. It cannot detect all clever uses of these types of code, so it + should not be relied on as an absolute fix. Use it wherever you would also + use the cookie crunch actions. +
Disable any refresh tags if the interval is greater than nine seconds (so + that redirections done via refresh tags are not destroyed). This is useful + for dial-on-demand setups, or for those who find this HTML feature + annoying. +
This filter attempts to prevent only "unsolicited" pop-up + windows from opening, yet still allow pop-up windows that the user + has explicitly chosen to open. It was added in version 3.0.1, + as an improvement over earlier such filters. +
Technical note: The filter works by redefining the window.open JavaScript + function to a dummy function, PrivoxyWindowOpen(), + during the loading and rendering phase of each HTML page access, and + restoring the function afterward. +
This is recommended only for browsers that cannot perform this function + reliably themselves. And be aware that some sites require such windows + in order to function normally. Use with caution. +
Attempt to prevent all pop-up windows from opening. + Note this should be used with even more discretion than the above, since + it is more likely to break some sites that require pop-ups for normal + usage. Use with caution. +
This is a helper filter that has no value if used alone. It makes the + banners-by-size and banners-by-link + (see below) filters more effective and should be enabled together with them. +
This filter removes image tags purely based on what size they are. Fortunately + for us, many ads and banner images tend to conform to certain standardized + sizes, which makes this filter quite effective for ad stripping purposes. +
Occasionally this filter will cause false positives on images that are not ads, + but just happen to be of one of the standard banner sizes. +
Recommended only for those who require extreme ad blocking. The default + block rules should catch 95+% of all ads without this filter enabled. +
This is an experimental filter that attempts to kill any banners if + their URLs seem to point to known or suspected click trackers. It is currently + not of much value and is not recommended for use by default. +
Webbugs are small, invisible images (technically 1X1 GIF images), that + are used to track users across websites, and collect information on them. + As an HTML page is loaded by the browser, an embedded image tag causes the + browser to contact a third-party site, disclosing the tracking information + through the requested URL and/or cookies for that third-party domain, without + the user ever becoming aware of the interaction with the third-party site. + HTML-ized spam also uses a similar technique to verify email addresses. +
This filter removes the HTML code that loads such "webbugs". +
A rather special-purpose filter that can be used to enlarge textareas (those + multi-line text boxes in web forms) and turn off hard word wrap in them. + It was written for the sourceforge.net tracker system where such boxes are + a nuisance, but it can be handy on other sites, too. +
It is not recommended to use this filter as a default. +
Many consider windows that move, or resize themselves to be abusive. This filter + neutralizes the related JavaScript code. Note that some sites might not display + or behave as intended when using this filter. Use with caution. +
Some web designers seem to assume that everyone in the world will view their + web sites using the same browser brand and version, screen resolution etc, + because only that assumption could explain why they'd use static frame sizes, + yet prevent their frames from being resized by the user, should they be too + small to show their whole content. +
This filter removes the related HTML code. It should only be applied to sites + which need it. +
Many Microsoft products that generate HTML use non-standard extensions (read: + violations) of the ISO 8859-1 aka Latin-1 character set. This can cause those + HTML documents to display with errors on standard-compliant platforms. +
This filter translates the MS-only characters into Latin-1 equivalents. + It is not necessary when using MS products, and will cause corruption of + all documents that use 8-bit character sets other than Latin-1. It's mostly + worthwhile for Europeans on non-MS platforms, if weird garbage characters + sometimes appear on some pages, or user agents that don't correct for this on + the fly. + +
A filter for shockwave haters. As the name suggests, this filter strips code + out of web pages that is used to embed shockwave flash objects. +
Change HTML code that embeds Quicktime objects so that kioskmode, which + prevents saving, is disabled. +
Text replacements for subversive browsing fun. Make fun of your favorite + Monopolist or play buzzword bingo. +
A demonstration-only filter that shows how Privoxy + can be used to delete web content on a keyword basis. +
An experimental collection of text replacements to disable malicious HTML and JavaScript + code that exploits known security holes in Internet Explorer. +
Presently, it only protects against Nimda and a cross-site scripting bug, and + would need active maintenance to provide more substantial protection. +
Some web sites have very specific problems, the cure for which doesn't apply + anywhere else, or could even cause damage on other sites. +
This is a collection of such site-specific cures which should only be applied + to the sites they were intended for, which is what the supplied + default.action file does. Users shouldn't need to change + anything regarding this filter. +
A CSS based block for Google text ads. Also removes a width limitation + and the toolbar advertisement. +
Another CSS based block, this time for Yahoo text ads. And removes + a width limitation as well. +
Another CSS based block, this time for MSN text ads. And removes + tracking URLs, as well as a width limitation. +
Cleans up some Blogspot blogs. Read the fine print before using this one! +
This filter also intentionally removes some navigation stuff and sets the + page width to 100%. As a result, some rounded "corners" would + appear to early or not at all and as fixing this would require a browser + that understands background-size (CSS3), they are removed instead. +
Server-header filter to change the Content-Type from xml to html. +
Server-header filter to change the Content-Type from html to xml. +
Removes the non-standard ping attribute from + anchor and area HTML tags. +
Client-header filter to remove the Tor exit node notation + found in Host and Referer headers. +
If Privoxy and Tor are chained and Privoxy + is configured to use socks4a, one can use "http://www.example.org.foobar.exit/" + to access the host "www.example.org" through the + Tor exit node "foobar". +
As the HTTP client isn't aware of this notation, it treats the + whole string "www.example.org.foobar.exit" as host and uses it + for the "Host" and "Referer" headers. From the + server's point of view the resulting headers are invalid and can cause problems. +
An invalid "Referer" header can trigger "hot-linking" + protections, an invalid "Host" header will make it impossible for + the server to find the right vhost (several domains hosted on the same IP address). +
This client-header filter removes the "foo.exit" part in those headers + to prevent the mentioned problems. Note that it only modifies + the HTTP headers, it doesn't make it impossible for the server + to detect your Tor exit node based on the IP address + the request is coming from. +