+
+ <div class="SECT2">
+ <h2 class="SECT2"><a name="AF-PATTERNS" id="AF-PATTERNS">8.4.
+ Patterns</a></h2>
+
+ <p>As mentioned, <span class="APPLICATION">Privoxy</span> uses
+ <span class="QUOTE">"patterns"</span> to determine what <span class=
+ "emphasis"><i class="EMPHASIS">actions</i></span> might apply to which
+ sites and pages your browser attempts to access. These <span class=
+ "QUOTE">"patterns"</span> use wild card type <span class=
+ "emphasis"><i class="EMPHASIS">pattern</i></span> matching to achieve a
+ high degree of flexibility. This allows one expression to be expanded
+ and potentially match against many similar patterns.</p>
+
+ <p>Generally, an URL pattern has the form <tt class=
+ "LITERAL"><host><port>/<path></tt>, where the
+ <tt class="LITERAL"><host></tt>, the <tt class=
+ "LITERAL"><port></tt> and the <tt class=
+ "LITERAL"><path></tt> are optional. (This is why the special
+ <tt class="LITERAL">/</tt> pattern matches all URLs). Note that the
+ protocol portion of the URL pattern (e.g. <tt class=
+ "LITERAL">http://</tt>) should <span class="emphasis"><i class=
+ "EMPHASIS">not</i></span> be included in the pattern. This is assumed
+ already!</p>
+
+ <p>The pattern matching syntax is different for the host and path parts
+ of the URL. The host part uses a simple globbing type matching
+ technique, while the path part uses more flexible <a href=
+ "http://en.wikipedia.org/wiki/Regular_expressions" target=
+ "_top"><span class="QUOTE">"Regular Expressions"</span></a> (POSIX
+ 1003.2).</p>
+
+ <p>The port part of a pattern is a decimal port number preceded by a
+ colon (<tt class="LITERAL">:</tt>). If the host part contains a
+ numerical IPv6 address, it has to be put into angle brackets
+ (<tt class="LITERAL"><</tt>, <tt class="LITERAL">></tt>).</p>
+
+ <div class="VARIABLELIST">
+ <dl>
+ <dt><tt class="LITERAL">www.example.com/</tt></dt>
+
+ <dd>
+ <p>is a host-only pattern and will match any request to
+ <tt class="LITERAL">www.example.com</tt>, regardless of which
+ document on that server is requested. So ALL pages in this domain
+ would be covered by the scope of this action. Note that a simple
+ <tt class="LITERAL">example.com</tt> is different and would NOT
+ match.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">www.example.com</tt></dt>
+
+ <dd>
+ <p>means exactly the same. For host-only patterns, the trailing
+ <tt class="LITERAL">/</tt> may be omitted.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">www.example.com/index.html</tt></dt>
+
+ <dd>
+ <p>matches all the documents on <tt class=
+ "LITERAL">www.example.com</tt> whose name starts with <tt class=
+ "LITERAL">/index.html</tt>.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">www.example.com/index.html$</tt></dt>
+
+ <dd>
+ <p>matches only the single document <tt class=
+ "LITERAL">/index.html</tt> on <tt class=
+ "LITERAL">www.example.com</tt>.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">/index.html$</tt></dt>
+
+ <dd>
+ <p>matches the document <tt class="LITERAL">/index.html</tt>,
+ regardless of the domain, i.e. on <span class=
+ "emphasis"><i class="EMPHASIS">any</i></span> web server
+ anywhere.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">/</tt></dt>
+
+ <dd>
+ <p>Matches any URL because there's no requirement for either the
+ domain or the path to match anything.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">:8000/</tt></dt>
+
+ <dd>
+ <p>Matches any URL pointing to TCP port 8000.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">10.0.0.1/</tt></dt>
+
+ <dd>
+ <p>Matches any URL with the host address <tt class=
+ "LITERAL">10.0.0.1</tt>. (Note that the real URL uses plain
+ brackets, not angle brackets.)</p>
+ </dd>
+
+ <dt><tt class="LITERAL"><2001:db8::1>/</tt></dt>
+
+ <dd>
+ <p>Matches any URL with the host address <tt class=
+ "LITERAL">2001:db8::1</tt>. (Note that the real URL uses plain
+ brackets, not angle brackets.)</p>
+ </dd>
+
+ <dt><tt class="LITERAL">index.html</tt></dt>
+
+ <dd>
+ <p>matches nothing, since it would be interpreted as a domain
+ name and there is no top-level domain called <tt class=
+ "LITERAL">.html</tt>. So its a mistake.</p>
+ </dd>
+ </dl>
+ </div>
+
+ <div class="SECT3">
+ <h3 class="SECT3"><a name="HOST-PATTERN" id="HOST-PATTERN">8.4.1. The
+ Host Pattern</a></h3>
+
+ <p>The matching of the host part offers some flexible options: if the
+ host pattern starts or ends with a dot, it becomes unanchored at that
+ end. The host pattern is often referred to as domain pattern as it is
+ usually used to match domain names and not IP addresses. For
+ example:</p>
+
+ <div class="VARIABLELIST">
+ <dl>
+ <dt><tt class="LITERAL">.example.com</tt></dt>
+
+ <dd>
+ <p>matches any domain with first-level domain <tt class=
+ "LITERAL">com</tt> and second-level domain <tt class=
+ "LITERAL">example</tt>. For example <tt class=
+ "LITERAL">www.example.com</tt>, <tt class=
+ "LITERAL">example.com</tt> and <tt class=
+ "LITERAL">foo.bar.baz.example.com</tt>. Note that it wouldn't
+ match if the second-level domain was <tt class=
+ "LITERAL">another-example</tt>.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">www.</tt></dt>
+
+ <dd>
+ <p>matches any domain that <span class="emphasis"><i class=
+ "EMPHASIS">STARTS</i></span> with <tt class="LITERAL">www.</tt>
+ (It also matches the domain <tt class="LITERAL">www</tt> but
+ most of the time that doesn't matter.)</p>
+ </dd>
+
+ <dt><tt class="LITERAL">.example.</tt></dt>
+
+ <dd>
+ <p>matches any domain that <span class="emphasis"><i class=
+ "EMPHASIS">CONTAINS</i></span> <tt class=
+ "LITERAL">.example.</tt>. And, by the way, also included would
+ be any files or documents that exist within that domain since
+ no path limitations are specified. (Correctly speaking: It
+ matches any FQDN that contains <tt class="LITERAL">example</tt>
+ as a domain.) This might be <tt class=
+ "LITERAL">www.example.com</tt>, <tt class=
+ "LITERAL">news.example.de</tt>, or <tt class=
+ "LITERAL">www.example.net/cgi/testing.pl</tt> for instance. All
+ these cases are matched.</p>
+ </dd>
+ </dl>
+ </div>
+
+ <p>Additionally, there are wild-cards that you can use in the domain
+ names themselves. These work similarly to shell globbing type
+ wild-cards: <span class="QUOTE">"*"</span> represents zero or more
+ arbitrary characters (this is equivalent to the <a href=
+ "http://en.wikipedia.org/wiki/Regular_expressions" target=
+ "_top"><span class="QUOTE">"Regular Expression"</span></a> based
+ syntax of <span class="QUOTE">".*"</span>), <span class=
+ "QUOTE">"?"</span> represents any single character (this is
+ equivalent to the regular expression syntax of a simple <span class=
+ "QUOTE">"."</span>), and you can define <span class=
+ "QUOTE">"character classes"</span> in square brackets which is
+ similar to the same regular expression technique. All of this can be
+ freely mixed:</p>
+
+ <div class="VARIABLELIST">
+ <dl>
+ <dt><tt class="LITERAL">ad*.example.com</tt></dt>
+
+ <dd>
+ <p>matches <span class="QUOTE">"adserver.example.com"</span>,
+ <span class="QUOTE">"ads.example.com"</span>, etc but not
+ <span class="QUOTE">"sfads.example.com"</span></p>
+ </dd>
+
+ <dt><tt class="LITERAL">*ad*.example.com</tt></dt>
+
+ <dd>
+ <p>matches all of the above, and then some.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">.?pix.com</tt></dt>
+
+ <dd>
+ <p>matches <tt class="LITERAL">www.ipix.com</tt>, <tt class=
+ "LITERAL">pictures.epix.com</tt>, <tt class=
+ "LITERAL">a.b.c.d.e.upix.com</tt> etc.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">www[1-9a-ez].example.c*</tt></dt>
+
+ <dd>
+ <p>matches <tt class="LITERAL">www1.example.com</tt>,
+ <tt class="LITERAL">www4.example.cc</tt>, <tt class=
+ "LITERAL">wwwd.example.cy</tt>, <tt class=
+ "LITERAL">wwwz.example.com</tt> etc., but <span class=
+ "emphasis"><i class="EMPHASIS">not</i></span> <tt class=
+ "LITERAL">wwww.example.com</tt>.</p>
+ </dd>
+ </dl>
+ </div>
+
+ <p>While flexible, this is not the sophistication of full regular
+ expression based syntax.</p>
+ </div>
+
+ <div class="SECT3">
+ <h3 class="SECT3"><a name="AEN2914" id="AEN2914">8.4.2. The Path
+ Pattern</a></h3>
+
+ <p><span class="APPLICATION">Privoxy</span> uses <span class=
+ "QUOTE">"modern"</span> POSIX 1003.2 <a href=
+ "http://en.wikipedia.org/wiki/Regular_expressions" target=
+ "_top"><span class="QUOTE">"Regular Expressions"</span></a> for
+ matching the path portion (after the slash), and is thus more
+ flexible.</p>
+
+ <p>There is an <a href="appendix.html#REGEX">Appendix</a> with a
+ brief quick-start into regular expressions, you also might want to
+ have a look at your operating system's documentation on regular
+ expressions (try <tt class="LITERAL">man re_format</tt>).</p>
+
+ <p>Note that the path pattern is automatically left-anchored at the
+ <span class="QUOTE">"/"</span>, i.e. it matches as if it would start
+ with a <span class="QUOTE">"^"</span> (regular expression speak for
+ the beginning of a line).</p>
+
+ <p>Please also note that matching in the path is <span class=
+ "emphasis"><i class="EMPHASIS">CASE INSENSITIVE</i></span> by
+ default, but you can switch to case sensitive at any point in the
+ pattern by using the <span class="QUOTE">"(?-i)"</span> switch:
+ <tt class="LITERAL">www.example.com/(?-i)PaTtErN.*</tt> will match
+ only documents whose path starts with <tt class=
+ "LITERAL">PaTtErN</tt> in <span class="emphasis"><i class=
+ "EMPHASIS">exactly</i></span> this capitalization.</p>
+
+ <div class="VARIABLELIST">
+ <dl>
+ <dt><tt class="LITERAL">.example.com/.*</tt></dt>
+
+ <dd>
+ <p>Is equivalent to just <span class=
+ "QUOTE">".example.com"</span>, since any documents within that
+ domain are matched with or without the <span class=
+ "QUOTE">".*"</span> regular expression. This is redundant</p>
+ </dd>
+
+ <dt><tt class="LITERAL">.example.com/.*/index.html$</tt></dt>
+
+ <dd>
+ <p>Will match any page in the domain of <span class=
+ "QUOTE">"example.com"</span> that is named <span class=
+ "QUOTE">"index.html"</span>, and that is part of some path. For
+ example, it matches <span class=
+ "QUOTE">"www.example.com/testing/index.html"</span> but NOT
+ <span class="QUOTE">"www.example.com/index.html"</span> because
+ the regular expression called for at least two <span class=
+ "QUOTE">"/'s"</span>, thus the path requirement. It also would
+ match <span class=
+ "QUOTE">"www.example.com/testing/index_html"</span>, because of
+ the special meta-character <span class="QUOTE">"."</span>.</p>
+ </dd>
+
+ <dt><tt class="LITERAL">.example.com/(.*/)?index\.html$</tt></dt>
+
+ <dd>
+ <p>This regular expression is conditional so it will match any
+ page named <span class="QUOTE">"index.html"</span> regardless
+ of path which in this case can have one or more <span class=
+ "QUOTE">"/'s"</span>. And this one must contain exactly
+ <span class="QUOTE">".html"</span> (but does not have to end
+ with that!).</p>
+ </dd>
+
+ <dt><tt class=
+ "LITERAL">.example.com/(.*/)(ads|banners?|junk)</tt></dt>
+
+ <dd>
+ <p>This regular expression will match any path of <span class=
+ "QUOTE">"example.com"</span> that contains any of the words
+ <span class="QUOTE">"ads"</span>, <span class=
+ "QUOTE">"banner"</span>, <span class="QUOTE">"banners"</span>
+ (because of the <span class="QUOTE">"?"</span>) or <span class=
+ "QUOTE">"junk"</span>. The path does not have to end in these
+ words, just contain them.</p>
+ </dd>
+
+ <dt><tt class=
+ "LITERAL">.example.com/(.*/)(ads|banners?|junk)/.*\.(jpe?g|gif|png)$</tt></dt>
+
+ <dd>
+ <p>This is very much the same as above, except now it must end
+ in either <span class="QUOTE">".jpg"</span>, <span class=
+ "QUOTE">".jpeg"</span>, <span class="QUOTE">".gif"</span> or
+ <span class="QUOTE">".png"</span>. So this one is limited to
+ common image formats.</p>
+ </dd>
+ </dl>
+ </div>
+
+ <p>There are many, many good examples to be found in <tt class=
+ "FILENAME">default.action</tt>, and more tutorials below in <a href=
+ "appendix.html#REGEX">Appendix on regular expressions</a>.</p>
+ </div>
+
+ <div class="SECT3">
+ <h3 class="SECT3"><a name="TAG-PATTERN" id="TAG-PATTERN">8.4.3. The
+ Tag Pattern</a></h3>
+
+ <p>Tag patterns are used to change the applying actions based on the
+ request's tags. Tags can be created with either the <a href=
+ "actions-file.html#CLIENT-HEADER-TAGGER">client-header-tagger</a> or
+ the <a href=
+ "actions-file.html#SERVER-HEADER-TAGGER">server-header-tagger</a>
+ action.</p>
+
+ <p>Tag patterns have to start with <span class="QUOTE">"TAG:"</span>,
+ so <span class="APPLICATION">Privoxy</span> can tell them apart from
+ URL patterns. Everything after the colon including white space, is
+ interpreted as a regular expression with path pattern syntax, except
+ that tag patterns aren't left-anchored automatically (<span class=
+ "APPLICATION">Privoxy</span> doesn't silently add a <span class=
+ "QUOTE">"^"</span>, you have to do it yourself if you need it).</p>
+
+ <p>To match all requests that are tagged with <span class=
+ "QUOTE">"foo"</span> your pattern line should be <span class=
+ "QUOTE">"TAG:^foo$"</span>, <span class="QUOTE">"TAG:foo"</span>
+ would work as well, but it would also match requests whose tags
+ contain <span class="QUOTE">"foo"</span> somewhere. <span class=
+ "QUOTE">"TAG: foo"</span> wouldn't work as it requires white
+ space.</p>
+
+ <p>Sections can contain URL and tag patterns at the same time, but
+ tag patterns are checked after the URL patterns and thus always
+ overrule them, even if they are located before the URL patterns.</p>
+
+ <p>Once a new tag is added, Privoxy checks right away if it's matched
+ by one of the tag patterns and updates the action settings
+ accordingly. As a result tags can be used to activate other tagger
+ actions, as long as these other taggers look for headers that haven't
+ already be parsed.</p>
+
+ <p>For example you could tag client requests which use the <tt class=
+ "LITERAL">POST</tt> method, then use this tag to activate another
+ tagger that adds a tag if cookies are sent, and then use a block
+ action based on the cookie tag. This allows the outcome of one
+ action, to be input into a subsequent action. However if you'd
+ reverse the position of the described taggers, and activated the
+ method tagger based on the cookie tagger, no method tags would be
+ created. The method tagger would look for the request line, but at
+ the time the cookie tag is created, the request line has already been
+ parsed.</p>
+
+ <p>While this is a limitation you should be aware of, this kind of
+ indirection is seldom needed anyway and even the example doesn't make
+ too much sense.</p>
+ </div>
+
+ <div class="SECT3">
+ <h3 class="SECT3"><a name="NEGATIVE-TAG-PATTERNS" id=
+ "NEGATIVE-TAG-PATTERNS">8.4.4. The Negative Tag Patterns</a></h3>
+
+ <p>To match requests that do not have a certain tag, specify a
+ negative tag pattern by prefixing the tag pattern line with either
+ <span class="QUOTE">"NO-REQUEST-TAG:"</span> or <span class=
+ "QUOTE">"NO-RESPONSE-TAG:"</span> instead of <span class=
+ "QUOTE">"TAG:"</span>.</p>
+
+ <p>Negative tag patterns created with <span class=
+ "QUOTE">"NO-REQUEST-TAG:"</span> are checked after all client headers
+ are scanned, the ones created with <span class=
+ "QUOTE">"NO-RESPONSE-TAG:"</span> are checked after all server
+ headers are scanned. In both cases all the created tags are
+ considered.</p>
+ </div>
+ </div>
+
+ <div class="SECT2">
+ <h2 class="SECT2"><a name="ACTIONS" id="ACTIONS">8.5. Actions</a></h2>
+
+ <p>All actions are disabled by default, until they are explicitly
+ enabled somewhere in an actions file. Actions are turned on if preceded
+ with a <span class="QUOTE">"+"</span>, and turned off if preceded with
+ a <span class="QUOTE">"-"</span>. So a <tt class="LITERAL">+action</tt>
+ means <span class="QUOTE">"do that action"</span>, e.g. <tt class=
+ "LITERAL">+block</tt> means <span class="QUOTE">"please block URLs that
+ match the following patterns"</span>, and <tt class=
+ "LITERAL">-block</tt> means <span class="QUOTE">"don't block URLs that
+ match the following patterns, even if <tt class="LITERAL">+block</tt>
+ previously applied."</span></p>
+
+ <p>Again, actions are invoked by placing them on a line, enclosed in
+ curly braces and separated by whitespace, like in <tt class=
+ "LITERAL">{+some-action -some-other-action{some-parameter}}</tt>,
+ followed by a list of URL patterns, one per line, to which they apply.
+ Together, the actions line and the following pattern lines make up a
+ section of the actions file.</p>
+
+ <p>Actions fall into three categories:</p>
+