Quickstart to Ad Blocking
+
+
+ Ad blocking is but one of Privoxy's
+ array of features. Many of these features are for the technically minded advanced
+ user. But, ad blocking is surely common ground for everybody.
+
+
+ This section will provide a quick overview of ad blocking so
+ you can get up to speed quickly without having to read the more extensive
+ information provided below, though this is highly recommeneded.
+
+
+ First a bit of a warning ... blocking ads is much like blocking SPAM: the
+ more aggressive you are about it, the more likely you are to block a few
+ things that were not intended. So there is a trade off here. If you want
+ extreme ad free browsing, be prepared to deal with more
+ problem sites, and to spend more time adjusting the
+ configuration to solve these unintended consequences.
+
+
+ Secondly, a quick note on Privoxy's
+ actions. Actions in this context, are
+ the directives we use to tell Privoxy to perform
+ some task relating to HTTP transactions (i.e. web browsing). We tell
+ Privoxy to take some action. Each
+ action has a unique name and function. While there are many potential
+ actions in Privoxy's
+ arsenal, only a few are used for ad blocking. Actions, and action
+ configuration files, are explained in depth below.
+
+
+ Actions are specified in Privoxy's configuration,
+ followed by one or more URLs to which the action should apply. URLs
+ can actually be URL type patterns that use
+ wildcards so they can apply potentially to a range of similar URLs.
+
+
+ When you connect to a website, the full path of the URL will either match one
+ of actions as defined in Privoxy's configuration,
+ or not. If so, then Privoxy will perform the
+ action accordingly. If not, then nothing special happens. Futhermore, web
+ pages may contain embedded, secondary URLs that your web browser will
+ display as it parses the original page's HTML content. An ad image for
+ instance, is just a URL embedded in the page somewhere. The image itself may
+ be on the same server, or a server somewhere else on the Internet. Complex
+ web pages will have many such embedded URLs.
+
+
+
+ The actions we need to know about for ad blocking are: block, handle-as-image, and set-image-blocker.
+
+
+
+
+
+
+
+ block - this action stops
+ any contact between your browser and any URL patterns that match this
+ action's configuration. It can be used for blocking ads, but also anything
+ that is determined to be unwanted. By itself, it simply stops any
+ communication with the remote server. If this is the only action that
+ matches for a particular URL, then Privoxy will
+ display its own BLOCKED page to let you now what has happened.
+
+
+
+
+
+ handle-as-image -
+ forces Privoxy to treat this URL as if it were
+ an image. Privoxy knows about common image
+ types (e.g. GIF), but there are many situations where this does not apply.
+ So we'll force it. This is particularly important for ad blocking, since
+ once we can treat it as an image, we can make more intelligent decisisions
+ on how to handle it. There are some limitations to this though. For
+ instance, you can't just force an image substituion for an entire HTML page
+ in most situations.
+
+
+
+
+
+ set-image-blocker -
+ tells Privoxy what to display in place of
+ an ad image that has hit a block rule. For this to come into play,
+ the URL must match a block action somewhere in the configuration.
+ And, it must also either be of a known image type, or
+ match an handle-as-image
+ action.
+
+
+ The configuration options on what to display instead of the ad are:
+
+
+
+ pattern - a checkboard pattern, so that an ad
+ replacement is obvious. This is the default.
+
+
+
+
+ blank - A very small empty GIF image is displayed.
+ This is the so-called invisible configuration option.
+
+
+
+
+ http://<URL> - A redirect to any URL of the
+ user's choosing.
+
+
+
+
+
+
+
+
+]]>
+
+
+
+
Starting Privoxy
@@ -1146,7 +1305,7 @@ actionsfile
- Default value:
+ Default values:
@@ -1199,7 +1358,7 @@ actionsfile
Specifies:
- The filter file to use
+ The filter file to use
@@ -1220,7 +1379,7 @@ actionsfile
No textual content filtering takes place, i.e. all
- +filter{name}
+ +filter{name}
actions in the actions files are turned neutral.
@@ -1229,13 +1388,25 @@ actionsfile
Notes:
- The default.filter file contains content modification rules
- that use regular expressions. These rules permit powerful
- changes on the content of Web pages, e.g., you could disable your favorite
+ The filter file contains content modification
+ rules that use regular expressions. These rules permit
+ powerful changes on the content of Web pages, e.g., you could disable your favorite
JavaScript annoyances, re-write the actual displayed text, or just have some
fun replacing Microsoft with MicroSuck wherever
it appears on a Web page.
+
+ The
+ +filter{name}
+ actions rely on the relevant filter (name)
+ to be defined in the filter file!
+
+
+ A pre-defined filter file called default.filter that contains
+ a bunch of handy filters for common problems is included in the distribution.
+ See the section on the filter
+ action for a list.
+
@@ -4225,7 +4396,7 @@ ad.doubleclick.net
sense to combine it with any filter action,
since as soon as one filter applies,
the whole document needs to be buffered anyway, which destroys the advantage of
- the kill-popups action over it's filter equivalent.
+ the kill-popups action over its filter equivalent.
Killing all pop-ups is a dangerous business. Many shops and banks rely on
@@ -4884,7 +5055,7 @@ my-internal-testing-server.void
-Sample Actions Files
+Actions Files Tutorial
The above chapters have shown which actions files
there are and how they are organized, how actions are
default.action
-Every config file should start with a short comment stating it's purpose:
+Every config file should start with a short comment stating its purpose:
@@ -5239,9 +5410,9 @@ count*.
#
{ -block }
adv[io]*. # (for advogato.org and advice.*)
-adsl.
+adsl. # (has nothing to do with ads)
ad[ud]*. # (adult.* and add.*)
-.edu # Universities
+.edu # (universities don't host banners (yet!))
.*loads. # (downloads, uploads etc)
# By path:
@@ -5322,7 +5493,9 @@ www.ugu.com/sui/ugu/adv
-crunch-all-cookies = -crunch-incoming-cookies -crunch-outgoing-cookies
mercy-for-cookies = -crunch-all-cookies -session-cookies-only
fragile = -block -crunch-all-cookies -filter -fast-redirects -hide-referer -kill-popups
-shop = mercy-for-cookies -filter{popups} -kill-popups
+shop = mercy-for-cookies -filter{popups} -kill-popups
+allow-ads = -block -filter{banners-by-size} # (see below)
+
@@ -5411,9 +5584,29 @@ another.popular.site.net/more/junk/here/
really shouldn't be filtered, like code on CVS->Web interfaces. Since
user.action has the last word, these exceptions
won't be valid for the fun filtering specified here.
- But you're the boss.
+
+ Finally, you might think about how your favourite free websites are
+ funded, and find that they rely on displaying banner advertisements
+ to survive. So you might want to specifically allow banners for those
+ sites that you feel provide value to you:
+
+
+
+
+{ allow-ads }
+.sourceforge.net
+.slashdot.org
+.osdn.net
+
+
+
+ Note that allow-ads has been aliased to
+ -block
+ -filter{banners-by-size}
+ above.
+
@@ -5427,129 +5620,306 @@ another.popular.site.net/more/junk/here/
The Filter File
+
- Any web page can be dynamically modified with the filter file. This
- modification can be removal, or re-writing, of any web page content,
- including tags and non-visible content. The default filter file is
- oddly enough default.filter, located in the config
- directory.
+ All text substitutions that can be invoked through the
+ filter action
+ must first be defined in the filter file, which is typically
+ called default.filter and which can be
+ selected through the
+ filterfile config
+ option.
- This is potentially a very powerful feature, and requires knowledge of both
- regular expression and HTML in order create custom
- filters. But, there are a number of useful filters included with
- Privoxy for many common situations.
+ Typical reasons for doing such substitutions are to eliminate
+ common annoyances in HTML and JavaScript, such as pop-up windows,
+ exit consoles, crippled windows without navigation tools, the
+ infamous <BLINK> tag etc, to suppress images with certain
+ width and height attributes (standard banner sizes or web-bugs),
+ or just to have fun. The possibilities are endless.
- The included example file is divided into sections. Each section begins
- with the FILTER keyword, followed by the identifier
- for that section, e.g. FILTER: webbugs. Each section performs
- a similar type of filtering, such as html-annoyances.
+ Filtering works on any text-based document type, including plain
+ text, HTML, JavaScript, CSS etc. (all text/*
+ MIME types). Substitutions are made at the source level, so if
+ you want to roll your own filters, you should be
+ familiar with HTML syntax.
- This file uses regular expressions to alter or remove any string in the
- target page. The expressions can only operate on one line at a time. Some
- examples from the included default default.filter:
+ Just like the actions files, the
+ filter file is organized in sections, which are called filters
+ here. Each filter consists of a heading line, that starts with the
+ keywordFILTER:, followed by
+ the filter's name, and a short (one line)
+ description of what it does. Below that line
+ come the jobs, i.e. lines that define the actual
+ text substitutions. By convention, the name of a filter
+ should describe what the filter eliminates. The
+ comment is used in the web-based
+ user interface.
- Stop web pages from displaying annoying messages in the status bar by
- deleting such references:
+ Once a filter called name has been defined
+ in the filter file, it can be invoked by using an action of the form
+ +filter{name}
+ in any actions file.
+
+
+
+ A filter header line for a filter called foo could look
+ like this:
-
-
-
- FILTER: html-annoyances
+ FILTER: foo Replace all "foo" with "bar"
+
- # New browser windows should be resizeable and have a location and status
- # bar. Make it so.
- #
- s/resizable="?(no|0)"?/resizable=1/ig s/noresize/yesresize/ig
- s/location="?(no|0)"?/location=1/ig s/status="?(no|0)"?/status=1/ig
- s/scrolling="?(no|0|Auto)"?/scrolling=1/ig
- s/menubar="?(no|0)"?/menubar=1/ig
+
+ Below that line, and up to the next header line, come the jobs that
+ define what text replacements the filter executes. They are specified
+ in a syntax that imitates Perl's
+ s/// operator. If you are familiar with Perl, you
+ will find this to be quite intuitive, and may want to look at the
+ PCRS man page
+ for the subtle differences to Perl behaviour. Most notably, the non-standard
+ option letter U is supported, which turns the default
+ to ungreedy matching.
+
- # The <BLINK> tag was a crime!
- #
- s*<blink>|</blink>**ig
+
+ If you are new to regular expressions, you might want to take a look at
+ the Appendix on regular expressions, and
+ see the Perl
+ manual for
+ the
+ s/// operator's syntax and Perl-style regular
+ expressions in general.
+ The below examples might also help to get you started.
+
- # Is this evil?
- #
- #s/framespacing="?(no|0)"?//ig
- #s/margin(height|width)=[0-9]*//gi
-
-
-
+
+
+Filter File Tutorial
+
+ Now, let's complete our foo filter. We have already defined
+ the heading, but the jobs are still missing. Since all it does is to replace
+ foo with bar, there is only one (trivial) job
+ needed:
- Just for kicks, replace any occurrence of Microsoft with
- MicroSuck, and have a little fun with topical buzzwords:
+ s/foo/bar/
-
-
-
- FILTER: fun
+ But wait! Didn't the comment say that all occurrences
+ of foo should be replaced? Our current job will only take
+ care of the first foo on each page. For global substitution,
+ we'll need to add the g option:
+
- s/microsoft(?!.com)/MicroSuck/ig
+
+ s/foo/bar/g
+
- # Buzzword Bingo:
- #
- s/industry-leading|cutting-edge|award-winning/<font color=red><b>BINGO!</b></font>/ig
-
-
-
+
+ Our complete filter now looks like this:
+
+
+ FILTER: foo Replace all "foo" with "bar"
+s/foo/bar/g
- Kill those pesky little web-bugs:
+ Let's look at some real filters for more interesting examples. Here you see
+ a filter that protects against some common annoyances that arise from JavaScript
+ abuse. Let's look at its jobs one after the other:
+
-
-
-
- # webbugs: Squish WebBugs (1x1 invisible GIFs used for user tracking)
- FILTER: webbugs
+
+FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
- s/<img\s+[^>]*?(width|height)\s*=\s*['"]?1\D[^>]*?(width|height)\s*=\s*['"]?1(\D[^>]*?)?>/<!-- Squished WebBug -->/sig
-
-
-
+# Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
+#
+s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg
+
+ Following the header line and a comment, you see the job. Note that it uses
+ | as the delimiter instead of /, because
+ the pattern contains a forward slash, which would otherwise have to be escaped
+ by a backslash (\).
+
-
-
-The +filter Action
- Filters are enabled with the +filter action from within
- one of the actions files. +filter requires one parameter, which
- should match one of the section identifiers in the filter file itself. Example:
+ Now, let's examine the pattern: it starts with the text <script.*
+ enclosed in parentheses. Since the dot matches any character, and *
+ means: Match an arbitrary number of the element left of myself, this
+ matches <script, followed by any text, i.e.
+ it matches the whole page, from the start of the first <script> tag.
-
- +filter{html-annoyances}
-
+
+ That's more than we want, but the pattern continues: document\.referrer
+ matches only the exact string document.referrer. The dot needed to
+ be escaped, i.e. preceded by a backslash, to take away its
+ special meaning as a joker, and make it just a regular dot. So far, the meaning is:
+ Match from the start of the first <script> tag in a the page, up to, and including,
+ the text document.referrer, if both are present
+ in the page (and appear in that order).
+
- This would activate that particular filter. Similarly, +filter
- can be turned off for selected sites as:
- -filter{html-annoyances}. Remember
- too, all actions are off by default, unless they are explicitly enabled in one
- of the actions files.
+ But there's still more pattern to go. The next element, again enclosed in parentheses,
+ is .*</script>. You already know what .*
+ means, so the whole pattern translates to: Match from the start of the first <script>
+ tag in a page to the end of the last <script> tag, provided that the text
+ document.referrer appears somewhere in between.
-
+
+ This is still not the whole story, since we have ignored the options and the parentheses:
+ The portions of the page matched by sub-patterns that are enclosed in parentheses, will be
+ remembered and be available through the variables $1, $2, ... in
+ the substitute. The U option switches to ungreedy matching, which means
+ that the first .* in the pattern will only eat up all
+ text in between <script and the first occurrence
+ of document.referrer, and that the second .* will
+ only span the text up to the first</script>
+ tag. Furthermore, the s option says that the match may span
+ multiple lines in the page, and the g option again means that the
+ substitution is global.
+
+
+
+ So, to summarize, the pattern means: Match all scripts that contain the text
+ document.referrer. Remember the parts of the script from
+ (and including) the start tag up to (and excluding) the string
+ document.referrer as $1, and the part following
+ that string, up to and including the closing tag, as $2.
+
+
+
+ Now the pattern is deciphered, but wasn't this about substituting things? So
+ lets look at the substitute: $1"Not Your Business!"$2 is
+ easy to read: The text remembered as $1, followed by
+ "Not Your Business!" (including
+ the quotation marks!), followed by the text remembered as $2.
+ This produces an exact copy of the original string, with the middle part
+ (the document.referrer) replaced by "Not Your
+ Business!".
+
+
+
+ The whole job now reads: Replace document.referrer by
+ "Not Your Business!" wherever it appears inside a
+ <script> tag. Note that this job won't break JavaScript syntax,
+ since both the original and the replacement are syntactically valid
+ string objects. The script just won't have access to the referrer
+ information anymore.
+
+
+
+ We'll show you two other jobs from the JavaScript taming department, but
+ this time only point out the constructs of special interest:
+
+
+
+
+# The status bar is for displaying link targets, not pointless blahblah
+#
+s/window\.status\s*=\s*['"].*?['"]/dUmMy=1/ig
+
+
+
+ \s stands for whitespace characters (space, tab, newline,
+ carriage return, form feed), so that \s* means: zero
+ or more whitespace. The ? in .*?
+ makes this matching of arbitrary text ungreedy. (Note that the U
+ option is not set). The ['"] construct means: a single
+ or a double quote.
+
+
+
+ So what does this job do? It replaces assignments of single- or double-quoted
+ strings to the window.status object with a dummy assignment
+ (using a variable name that is hopefully odd enough not to conflict with
+ real variables in scripts). Thus, it catches many cases where e.g. pointless
+ descriptions are displayed in the status bar instead of the link target when
+ you move your mouse over links.
+
+
+
+
+# Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
+#
+s/(<body .*)onunload(.*>)/$1never$2/iU
+
+
+
+ Including the
+ OnUnload
+ event binding in the HTML DOM was a CRIME.
+ When I close a browser window, I want it to close and die. Basta.
+ This job replaces the onunload attribute in
+ <body> tags with the dummy word never.
+ Note that the i option makes the pattern matching
+ case-insensitive.
+
+
+
+ The last example is from the fun department:
+
+
+
+
+FILTER: fun Fun text replacements
+
+# Spice the daily news:
+#
+s/microsoft(?!\.com)/MicroSuck/ig
+
+
+
+ Note the (?!\.com) part (a so-called negative lookahead)
+ in the job's pattern, which means: Don't match, if the string
+ .com appears directly following microsoft
+ in the page. This prevents links to microsoft.com from being messed, while
+ still replacing the word everywhere else.
+
+
+
+
+# Buzzword Bingo (example for extended regex syntax)
+#
+s* industry[ -]leading \
+| cutting[ -]edge \
+| award[ -]winning # Comments are OK, too! \
+| high[ -]performance \
+| solutions[ -]based \
+| unmatched \
+| unparalleled \
+| unrivalled \
+*<font color="red"><b>BINGO!</b></font> \
+*igx
+
+
+
+ The x option in this job turns on extended syntax, and allows for
+ e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.
+
+
+ You get the idea?
+
+
@@ -6521,6 +6891,12 @@ Requests
Temple Place - Suite 330, Boston, MA 02111-1307, USA.
$Log: user-manual.sgml,v $
+ Revision 1.115 2002/05/16 16:25:00 oes
+ Extended the Filter File chapter & minor fixes
+
+ Revision 1.114 2002/05/16 09:42:50 oes
+ More ulink->link, added some hints to Quickstart section
+
Revision 1.113 2002/05/15 21:07:25 oes
Extended and further commented the example actions files