+<para>
+ This is still not the whole story, since we have ignored the options and the parentheses:
+ The portions of the page matched by sub-patterns that are enclosed in parentheses, will be
+ remembered and be available through the variables <literal>$1, $2, ...</literal> in
+ the substitute. The <literal>U</literal> option switches to ungreedy matching, which means
+ that the first <literal>.*</literal> in the pattern will only <quote>eat up</quote> all
+ text in between <quote><script</quote> and the <emphasis>first</emphasis> occurrence
+ of <quote>document.referrer</quote>, and that the second <literal>.*</literal> will
+ only span the text up to the <emphasis>first</emphasis> <quote></script></quote>
+ tag. Furthermore, the <literal>s</literal> option says that the match may span
+ multiple lines in the page, and the <literal>g</literal> option again means that the
+ substitution is global.
+</para>
+
+<para>
+ So, to summarize, the pattern means: Match all scripts that contain the text
+ <quote>document.referrer</quote>. Remember the parts of the script from
+ (and including) the start tag up to (and excluding) the string
+ <quote>document.referrer</quote> as <literal>$1</literal>, and the part following
+ that string, up to and including the closing tag, as <literal>$2</literal>.
+</para>
+
+<para>
+ Now the pattern is deciphered, but wasn't this about substituting things? So
+ lets look at the substitute: <literal>$1"Not Your Business!"$2</literal> is
+ easy to read: The text remembered as <literal>$1</literal>, followed by
+ <literal>"Not Your Business!"</literal> (<emphasis>including</emphasis>
+ the quotation marks!), followed by the text remembered as <literal>$2</literal>.
+ This produces an exact copy of the original string, with the middle part
+ (the <quote>document.referrer</quote>) replaced by <literal>"Not Your
+ Business!"</literal>.
+</para>
+
+<para>
+ The whole job now reads: Replace <quote>document.referrer</quote> by
+ <literal>"Not Your Business!"</literal> wherever it appears inside a
+ <script> tag. Note that this job won't break JavaScript syntax,
+ since both the original and the replacement are syntactically valid
+ string objects. The script just won't have access to the referrer
+ information anymore.
+</para>
+
+<para>
+ We'll show you two other jobs from the JavaScript taming department, but
+ this time only point out the constructs of special interest:
+</para>
+
+<para>
+ <screen>
+# The status bar is for displaying link targets, not pointless blahblah
+#
+s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig</screen>
+</para>
+
+<para>
+ <literal>\s</literal> stands for whitespace characters (space, tab, newline,
+ carriage return, form feed), so that <literal>\s*</literal> means: <quote>zero
+ or more whitespace</quote>. The <literal>?</literal> in <literal>.*?</literal>
+ makes this matching of arbitrary text ungreedy. (Note that the <literal>U</literal>
+ option is not set). The <literal>['"]</literal> construct means: <quote>a single
+ <emphasis>or</emphasis> a double quote</quote>. Finally, <literal>\1</literal> is
+ a back-reference to the first parenthesis just like <literal>$1</literal> above,
+ with the difference that in the <emphasis>pattern</emphasis>, a backslash indicates
+ a back-reference, whereas in the <emphasis>substitute</emphasis>, it's the dollar.
+</para>
+
+<para>
+ So what does this job do? It replaces assignments of single- or double-quoted
+ strings to the <quote>window.status</quote> object with a dummy assignment
+ (using a variable name that is hopefully odd enough not to conflict with
+ real variables in scripts). Thus, it catches many cases where e.g. pointless
+ descriptions are displayed in the status bar instead of the link target when
+ you move your mouse over links.
+</para>
+
+<para>
+ <screen>
+# Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
+#
+s/(<body [^>]*)onunload(.*>)/$1never$2/iU</screen>
+</para>
+
+<para>
+ Including the
+ <ulink url="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents">OnUnload
+ event binding</ulink> in the HTML DOM was a <emphasis>CRIME</emphasis>.
+ When I close a browser window, I want it to close and die. Basta.
+ This job replaces the <quote>onunload</quote> attribute in
+ <quote><body></quote> tags with the dummy word <literal>never</literal>.
+ Note that the <literal>i</literal> option makes the pattern matching
+ case-insensitive. Also note that ungreedy matching alone doesn't always guarantee
+ a minimal match: In the first parenthesis, we had to use <literal>[^>]*</literal>
+ instead of <literal>.*</literal> to prevent the match from exceeding the
+ <body> tag if it doesn't contain <quote>OnUnload</quote>, but the page's
+ content does.
+</para>
+
+<para>
+ The last example is from the fun department:
+</para>
+
+<para>
+ <screen>
+FILTER: fun Fun text replacements
+
+# Spice the daily news:
+#
+s/microsoft(?!\.com)/MicroSuck/ig</screen>
+</para>
+
+<para>
+ Note the <literal>(?!\.com)</literal> part (a so-called negative lookahead)
+ in the job's pattern, which means: Don't match, if the string
+ <quote>.com</quote> appears directly following <quote>microsoft</quote>
+ in the page. This prevents links to microsoft.com from being trashed, while
+ still replacing the word everywhere else.
+</para>
+
+<para>
+ <screen>
+# Buzzword Bingo (example for extended regex syntax)
+#
+s* industry[ -]leading \
+| cutting[ -]edge \
+| customer[ -]focused \
+| market[ -]driven \
+| award[ -]winning # Comments are OK, too! \
+| high[ -]performance \
+| solutions[ -]based \
+| unmatched \
+| unparalleled \
+| unrivalled \
+*<font color="red"><b>BINGO!</b></font> \
+*igx</screen>
+</para>
+
+<para>
+ The <literal>x</literal> option in this job turns on extended syntax, and allows for
+ e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.
+</para>
+
+<para>
+ You get the idea?
+</para>
+</sect2>
+
+<!-- ~~~~~~~~ New section Header ~~~~~~~~~ -->
+
+<sect2 id="predefined-filters"><title>The Pre-defined Filters</title>
+
+<!--
+
+ Note each filter is also listed in the +filter action section above. Please
+ keep these listings in sync.
+
+-->
+
+<para>
+The distribution <filename>default.filter</filename> file contains a selection of
+pre-defined filters for your convenience:
+</para>
+
+<variablelist>
+ <varlistentry>
+ <term><emphasis>js-annoyances</emphasis></term>
+ <listitem>
+ <para>
+ The purpose of this filter is to get rid of particularly annoying JavaScript abuse.
+ To that end, it
+ <itemizedlist>
+ <listitem>
+ <para>
+ replaces JavaScript references to the browser's referrer information
+ with the string "Not Your Business!". This compliments the <literal><link
+ linkend="hide-referrer">hide-referrer</link></literal> action on the content level.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ removes the bindings to the DOM's
+ <ulink url="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents">unload
+ event</ulink> which we feel has no right to exist and is responsible for most <quote>exit consoles</quote>, i.e.
+ nasty windows that pop up when you close another one.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ removes code that causes new windows to be opened with undesired properties, such as being
+ full-screen, non-resizeable, without location, status or menu bar etc.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ Use with caution. This is an aggressive filter, and can break sites that
+ rely heavily on JavaScript.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>js-events</emphasis></term>
+ <listitem>
+ <para>
+ This is a very radical measure. It removes virtually all JavaScript event bindings, which
+ means that scripts can not react to user actions such as mouse movements or clicks, window
+ resizing etc, anymore. Use with caution!
+ </para>
+ <para>
+ We <emphasis>strongly discourage</emphasis> using this filter as a default since it breaks
+ many legitimate scripts. It is meant for use only on extra-nasty sites (should you really
+ need to go there).
+ </para>
+ </listitem>
+ </varlistentry>
+
+<varlistentry>
+ <term><emphasis>html-annoyances</emphasis></term>
+ <listitem>
+ <para>
+ This filter will undo many common instances of HTML based abuse.
+ </para>
+ <para>
+ The <literal>BLINK</literal> and <literal>MARQUEE</literal> tags
+ are neutralized (yeah baby!), and browser windows will be created as
+ resizeable (as of course they should be!), and will have location,
+ scroll and menu bars -- even if specified otherwise.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>content-cookies</emphasis></term>
+ <listitem>
+ <para>
+ Most cookies are set in the HTTP dialog, where they can be intercepted
+ by the
+ <literal><link linkend="crunch-incoming-cookies">crunch-incoming-cookies</link></literal>
+ and <literal><link linkend="crunch-outgoing-cookies">crunch-outgoing-cookies</link></literal>
+ actions. But web sites increasingly make use of HTML meta tags and JavaScript
+ to sneak cookies to the browser on the content level.
+ </para>
+ <para>
+ This filter disables most HTML and JavaScript code that reads or sets
+ cookies. It cannot detect all clever uses of these types of code, so it
+ should not be relied on as an absolute fix. Use it wherever you would also
+ use the cookie crunch actions.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>refresh tags</emphasis></term>
+ <listitem>
+ <para>
+ Disable any refresh tags if the interval is greater than nine seconds (so
+ that redirections done via refresh tags are not destroyed). This is useful
+ for dial-on-demand setups, or for those who find this HTML feature
+ annoying.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>unsolicited-popups</emphasis></term>
+ <listitem>
+ <para>
+ This filter attempts to prevent only <quote>unsolicited</quote> pop-up
+ windows from opening, yet still allow pop-up windows that the user
+ has explicitly chosen to open. It was added in version 3.0.1,
+ as an improvement over earlier such filters.
+ </para>
+ <para>
+ Technical note: The filter works by redefining the window.open JavaScript
+ function to a dummy function, <literal>PrivoxyWindowOpen()</literal>,
+ during the loading and rendering phase of each HTML page access, and
+ restoring the function afterward.
+ </para>
+ <para>
+ This is recommended only for browsers that cannot perform this function
+ reliably themselves. And be aware that some sites require such windows
+ in order to function normally. Use with caution.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>all-popups</emphasis></term>
+ <listitem>
+ <para>
+ Attempt to prevent <emphasis>all</emphasis> pop-up windows from opening.
+ Note this should be used with even more discretion than the above, since
+ it is more likely to break some sites that require pop-ups for normal
+ usage. Use with caution.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>img-reorder</emphasis></term>
+ <listitem>
+ <para>
+ This is a helper filter that has no value if used alone. It makes the
+ <literal>banners-by-size</literal> and <literal>banners-by-link</literal>
+ (see below) filters more effective and should be enabled together with them.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>banners-by-size</emphasis></term>
+ <listitem>
+ <para>
+ This filter removes image tags purely based on what size they are. Fortunately
+ for us, many ads and banner images tend to conform to certain standardized
+ sizes, which makes this filter quite effective for ad stripping purposes.
+ </para>
+ <para>
+ Occasionally this filter will cause false positives on images that are not ads,
+ but just happen to be of one of the standard banner sizes.
+ </para>
+ <para>
+ Recommended only for those who require extreme ad blocking. The default
+ block rules should catch 95+% of all ads <emphasis>without</emphasis> this filter enabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>banners-by-link</emphasis></term>
+ <listitem>
+ <para>
+ This is an experimental filter that attempts to kill any banners if
+ their URLs seem to point to known or suspected click trackers. It is currently
+ not of much value and is not recommended for use by default.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>webbugs</emphasis></term>
+ <listitem>
+ <para>
+ Webbugs are small, invisible images (technically 1X1 GIF images), that
+ are used to track users across websites, and collect information on them.
+ As an HTML page is loaded by the browser, an embedded image tag causes the
+ browser to contact a third-party site, disclosing the tracking information
+ through the requested URL and/or cookies for that third-party domain, without
+ the user ever becoming aware of the interaction with the third-party site.
+ HTML-ized spam also uses a similar technique to verify email addresses.
+ </para>
+ <para>
+ This filter removes the HTML code that loads such <quote>webbugs</quote>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>tiny-textforms</emphasis></term>
+ <listitem>
+ <para>
+ A rather special-purpose filter that can be used to enlarge textareas (those
+ multi-line text boxes in web forms) and turn off hard word wrap in them.
+ It was written for the sourceforge.net tracker system where such boxes are
+ a nuisance, but it can be handy on other sites, too.
+ </para>
+ <para>
+ It is not recommended to use this filter as a default.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>jumping-windows</emphasis></term>
+ <listitem>
+ <para>
+ Many consider windows that move, or resize themselves to be abusive. This filter
+ neutralizes the related JavaScript code. Note that some sites might not display
+ or behave as intended when using this filter. Use with caution.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>frameset-borders</emphasis></term>
+ <listitem>
+ <para>
+ Some web designers seem to assume that everyone in the world will view their
+ web sites using the same browser brand and version, screen resolution etc,
+ because only that assumption could explain why they'd use static frame sizes,
+ yet prevent their frames from being resized by the user, should they be too
+ small to show their whole content.
+ </para>
+ <para>
+ This filter removes the related HTML code. It should only be applied to sites
+ which need it.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>demoronizer</emphasis></term>
+ <listitem>
+ <para>
+ Many Microsoft products that generate HTML use non-standard extensions (read:
+ violations) of the ISO 8859-1 aka Latin-1 character set. This can cause those
+ HTML documents to display with errors on standard-compliant platforms.
+ </para>
+ <para>
+ This filter translates the MS-only characters into Latin-1 equivalents.
+ It is not necessary when using MS products, and will cause corruption of
+ all documents that use 8-bit character sets other than Latin-1. It's mostly
+ worthwhile for Europeans on non-MS platforms, if weird garbage characters
+ sometimes appear on some pages, or user agents that don't correct for this on
+ the fly.
+<!--
+ My version of Mozilla (ancient) shows litte square boxes for quote
+ characters, and apostrophes on moronized pages. So many pages have this, I
+ can read them fine now. HB 08/27/06
+-->
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>shockwave-flash</emphasis></term>
+ <listitem>
+ <para>
+ A filter for shockwave haters. As the name suggests, this filter strips code
+ out of web pages that is used to embed shockwave flash objects.
+ </para>
+ <para>
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>quicktime-kioskmode</emphasis></term>
+ <listitem>
+ <para>
+ Change HTML code that embeds Quicktime objects so that kioskmode, which
+ prevents saving, is disabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>fun</emphasis></term>
+ <listitem>
+ <para>
+ Text replacements for subversive browsing fun. Make fun of your favorite
+ Monopolist or play buzzword bingo.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>crude-parental</emphasis></term>
+ <listitem>
+ <para>
+ A demonstration-only filter that shows how <application>Privoxy</application>
+ can be used to delete web content on a keyword basis.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>ie-exploits</emphasis></term>
+ <listitem>
+ <para>
+ An experimental collection of text replacements to disable malicious HTML and JavaScript
+ code that exploits known security holes in Internet Explorer.
+ </para>
+ <para>
+ Presently, it only protects against Nimda and a cross-site scripting bug, and
+ would need active maintenance to provide more substantial protection.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>site-specifics</emphasis></term>
+ <listitem>
+ <para>
+ Some web sites have very specific problems, the cure for which doesn't apply
+ anywhere else, or could even cause damage on other sites.
+ </para>
+ <para>
+ This is a collection of such site-specific cures which should only be applied
+ to the sites they were intended for, which is what the supplied
+ <filename>default.action</filename> file does. Users shouldn't need to change
+ anything regarding this filter.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>google</emphasis></term>
+ <listitem>
+ <para>
+ A CSS based block for Google text ads. Also removes a width limitation
+ and the toolbar advertisement.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>yahoo</emphasis></term>
+ <listitem>
+ <para>
+ Another CSS based block, this time for Yahoo text ads. And removes
+ a width limitation as well.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>msn</emphasis></term>
+ <listitem>
+ <para>
+ Another CSS based block, this time for MSN text ads. And removes
+ tracking URLs, as well as a width limitation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>blogspot</emphasis></term>
+ <listitem>
+ <para>
+ Cleans up some Blogspot blogs. Read the fine print before using this one!
+ </para>
+ <para>
+ This filter also intentionally removes some navigation stuff and sets the
+ page width to 100%. As a result, some rounded <quote>corners</quote> would
+ appear to early or not at all and as fixing this would require a browser
+ that understands background-size (CSS3), they are removed instead.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>xml-to-html</emphasis></term>
+ <listitem>
+ <para>
+ Server-header filter to change the Content-Type from xml to html.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>html-to-xml</emphasis></term>
+ <listitem>
+ <para>
+ Server-header filter to change the Content-Type from html to xml.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>no-ping</emphasis></term>
+ <listitem>
+ <para>
+ Removes the non-standard <literal>ping</literal> attribute from
+ anchor and area HTML tags.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><emphasis>hide-tor-exit-notation</emphasis></term>
+ <listitem>
+ <para>
+ Client-header filter to remove the <command>Tor</command> exit node notation
+ found in Host and Referer headers.
+ </para>
+ <para>
+ If &my-app; and <command>Tor</command> are chained and &my-app;
+ is configured to use socks4a, one can use <quote>http://www.example.org.foobar.exit/</quote>
+ to access the host <quote>www.example.org</quote> through the
+ <command>Tor</command> exit node <quote>foobar</quote>.
+ </para>
+ <para>
+ As the HTTP client isn't aware of this notation, it treats the
+ whole string <quote>www.example.org.foobar.exit</quote> as host and uses it
+ for the <quote>Host</quote> and <quote>Referer</quote> headers. From the
+ server's point of view the resulting headers are invalid and can cause problems.
+ </para>
+ <para>
+ An invalid <quote>Referer</quote> header can trigger <quote>hot-linking</quote>
+ protections, an invalid <quote>Host</quote> header will make it impossible for
+ the server to find the right vhost (several domains hosted on the same IP address).
+ </para>
+ <para>
+ This client-header filter removes the <quote>foo.exit</quote> part in those headers
+ to prevent the mentioned problems. Note that it only modifies
+ the HTTP headers, it doesn't make it impossible for the server
+ to detect your <command>Tor</command> exit node based on the IP address
+ the request is coming from.
+ </para>
+ </listitem>
+ </varlistentry>
+
+<!--
+ <varlistentry>
+ <term><emphasis> </emphasis></term>
+ <listitem>
+ <para>
+ </para>
+ <para>
+ </para>
+ </listitem>
+ </varlistentry>
+-->
+</variablelist>