From: hal9 Date: Thu, 27 Sep 2001 23:50:29 +0000 (+0000) Subject: A few changes. A short section on regular expression in appendix. X-Git-Tag: v_2_9_9~31 X-Git-Url: http://www.privoxy.org/gitweb/?a=commitdiff_plain;h=85a498648ac6227df9b1121194e42ce98731ee81;p=privoxy.git A few changes. A short section on regular expression in appendix. --- diff --git a/doc/source/user-manual.sgml b/doc/source/user-manual.sgml index 8fec37f6..dae003d9 100644 --- a/doc/source/user-manual.sgml +++ b/doc/source/user-manual.sgml @@ -7,7 +7,7 @@ This file belongs into ijbswa.sourceforge.net:/home/groups/i/ij/ijbswa/htdocs/ - $Id: user-manual.sgml,v 1.7 2001/09/24 14:31:36 hal9 Exp $ + $Id: user-manual.sgml,v 1.8 2001/09/25 00:34:59 hal9 Exp $ Written by and Copyright (C) 2001 the SourceForge IJBSWA team. http://ijbswa.sourceforge.net @@ -30,7 +30,7 @@ Hal Burgiss Junkbuster User Manual -$Id: user-manual.sgml,v 1.7 2001/09/24 14:31:36 hal9 Exp $ +$Id: user-manual.sgml,v 1.8 2001/09/25 00:34:59 hal9 Exp $ @@ -83,10 +83,67 @@ You can find the latest version of the user manual at Junkbuster Configuration - For Unix and Linux, all configuraton files are located in + For Unix, *BSD and Linux, all configuraton files are located in /etc/junkbuster/ by default. For MS Windows and OS/2, these are all in the same directory as the Junkbuster executable. The name and number of @@ -344,7 +401,7 @@ configuration section below. The main configuration file is named config - on Linux, Unix, and OS/2, and junkbustr.txt on + on Linux, Unix, BSD, and OS/2, and junkbustr.txt on Windows. @@ -382,7 +439,7 @@ configuration section below. The Main Configuration File Again, the main configuration file is named config on - Linux/Unix and OS/2, and junkbustr.txt on Windows. + Linux/Unix/BSD and OS/2, and junkbustr.txt on Windows. Configuration lines consist of an initial keyword followed by a list of values, all separated by whitespace (any number of spaces or tabs). For example: @@ -2445,16 +2502,18 @@ Removed references to Win32. HB 09/23/01 The included default configuration files should give a reasonable starting - point, though may be aggressive in blocking junk. You will probably want to - keep an eye out for sites that require cookies, and add these to - actionsfile as needed. By default, most of these will be - blocked until you add them to the configuration. If you want the browser to - handle this, you will need to edit actionsfile and - disable this feature. + point, though may be somewhat aggressive in blocking junk. You will probably + want to keep an eye out for sites that require cookies, and add these to + actionsfile as needed. By default, most of these will + be blocked until you add them to the configuration. If you want the browser + to handle this, you will need to edit actionsfile and + disable this feature. If you use more than one browser, it would make more + sense to let Junkbuster handle this. In which + case, the browser(s) should be set to accept all cookies. - If you enter counter problems, please verify it is a + If you encounter problems, please verify it is a Junkbuster bug, by disabling Junkbuster, and then trying the same page. Before reporting it as a bug, see if there is not a configuration @@ -2474,8 +2533,8 @@ To be filled. mention the support forums as the primary channel of communication (bugs, feature requests, etc.) --> Feature requests and other questions should be posted to the Support Forums at - SourceForge. There is also an archive there. + url="http://sourceforge.net/tracker/?atid=361118&group_id=11118&func=browse">Feature + request page at SourceForge. There is also an archive there. @@ -2488,6 +2547,9 @@ communication (bugs, feature requests, etc.) Please report bugs, using the form at Sourceforge. + Please try to verify that it is a Junkbuster bug, + and not a browser or site bug first. Also, check to make sure this is not + already a known bug. @@ -2532,7 +2594,7 @@ communication (bugs, feature requests, etc.) Waldherr made many improvements, and started the SourceForge project to rekindle development. The last stable release was v2.0.2, which has now - grown whiskers ;-), + grown whiskers ;-). @@ -2555,7 +2617,228 @@ communication (bugs, feature requests, etc.) Regular Expressions - Some expressions are regular, and some are not. + Junkbuster can use regular expressions + in various config files. Assuming support for pcre (Perl + Compatible Regular Expressions) is compiled in, which is the default. Such + configuration directives do not require regular expressions, but they can be + used to increase flexibility by matching a pattern with wildcards against + URLs. + + + + If you are reading this, you probably don't understand what regular + expressions are, or what they can do. So this will be a very brief + introduction only. A full explanation would require a book ;-) + + + + Regular expressions is a way of matching one character + expression against another to see if it matches or not. One of the + expressions is a literal string of readable characters + (letter, numbers, etc), and the other is a complex string of literal + characters combined with wildcards, and other special characters, called + metacharacters. The metacharacters have special meanings and + are used to build the complex pattern to be matched against. Perl Compatible + Regular Expressions is an enhanced form of the regular expression language + with backward compatibility. + + + + To make a simple analogy, we do something similar when we use wildcard + characters when listing files with the dir command in DOS. + *.* matches all filenames. The special + character here is the asterik which matches any and all characters. We can be + more specific and use ? to match just individual + characters. So dir file?.text would match + file1.txt, file2.txt, etc. We are pattern + matching, using a similar technique to regular expressions! + + + + Regular expressions do essentially the same thing, but are much, much more + powerful. There are many more special characters and ways of + building complex patterns however. Let's look at a few of the common ones, + and then some examples: + + + + + . - Matches any single character, e.g. a, + A, 4, :, or @. + + + + + + ? - The preceding character or expression is matched ZERO or ONE + times. Either/or. + + + + + + + - The preceding character or expression is matched ONE or MORE + times. + + + + + + * - The preceding character or expression is matched ZERO or MORE + times. + + + + + + \ - The escape character denotes that + the following character should be taken literally. This is used where one of the + special characters (e.g. .) needs to be taken literally and + not as a special metacharacter. + + + + + + [] - Characters enclosed in brackets will be matched if + any of the enclosed characters are encountered. + + + + + + () - Pararentheses are used to group a sub-expression, + or multiple sub-expressions. + + + + + + | - The bar character works like an + or conditional statement. A match is successful if the + sub-expression on either side of | matches. + + + + + + s/string1/string2/g - This is used to rewrite strings of text. + string1 is replaced by string2 in this + example. + + + + + These are just some of the ones you are likely to use when matching URLs with + Junkbuster, and is a long way from a definitive + list. This is enough to get us started with a few simple examples which may + be more illuminating: + + + + /.*/banners/.* - A simple example + that uses the common combination of . and * to + denote any character, zero or more times. In other words, any string at all. + So we start with a literal forward slash, then our regular expression pattern + (.*) another literal forward slash, the string + banners, another forward slash, and lastly another + .*. We are building + a directory path here. This will match any file with the path that has a + directory named banners in it. The .* matches + any characters, and this could conceivably be more forward slashes, so it + might expand into a much longer looking path. For example, this could match: + /eye/hate/spammers/banners/annoy_me_please.gif, or just + /banners/annoying.html, or almost an infinite number of other + possible combinations, just so it has banners in the path + somewhere. + + + + A now something a little more complex: + + + + /.*/adv((er)?ts?|ertis(ing|ements?))?/ - + We have several literal forward slashes again (/), so we are + building another expression that is a file path statement. We have another + .*, so we are matching against any conceivable sub-path, just so + it matches our expression. The only true literal that must + match our pattern is adv, together with + the forward slashes. What comes after the adv string is the + interesting part. + + + + Remember the ? means the preceding expression (either a + literal character or anything grouped with (...) in this case) + can exist or not, since this means either zero or one match. So + ((er)?ts?|ertis(ing|ements?)) is optional, as are the + individual sub-expressions: (er), + (ing|ements?), and the s. The | + means or. We have two of those. For instance, + (ing|ements?), can expand to match either ing + OR ements?. What is being done here, is an + attempt at matching as many variations of advertisement, and + similar, as possible. So this would expand to match just adv, + or advert, or adverts, or + advertising, or advertisement, or + advertisements. You get the idea. But it would not match + advertizements (with a z). We could fix that by + changing our regular expression to: + /.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/, which would then match + either spelling. + + + + /.*/advert[0-9]+\.(gif|jpe?g) - Again + another path statement with forward slashes. Anything in the square brackets + [] can be matched. This is using 0-9 as a + shorthand expression to mean any digit one through nine. It is the same as + saying 0123456789. So any digit matches. The + + means one or more of the preceding expression must be included. The preceding + expression here is what is in the square brackets -- in this case, any digit + one through nine. Then, at the end, we have a grouping: (gif|jpe?g). + This includes a |, so this needs to match the expression on + either side of that bar character also. A simple gif on one side, and the other + side will in turn match either jpeg or jpg, + since the ? means the letter e is optional and + can be matched once or not at all. So we are building an expression here to + match image GIF or JPEG type image file. It must include the literal + string advert, then one or more digits, and a . + (which is now a literal, and not a special character, since it is escaped + with \), and lastly either gif, or + jpeg, or jpg. Some possible matches would + include: //advert1.jpg, + /nasty/ads/advert1234.gif, + /banners/from/hell/advert99.jpg. It would not match + advert1.gif (no leading slash), or + /adverts232.jpg (the expression does not include an + s), or /advert1.jsp (jsp is not + in the expression anywhere). + + + + s/microsoft(?!.com)/MicroSuck/i - This is + a substitution. MicroSuck will replace any occurence of + microsoft. The i at the end of the expression + means ignore case. The (?!.com) means + the match should fail if microsoft is followed by + .com. In other words, this acts like a NOT + modifier. In case this is a hyperlink, we don't want to break it ;-). + + + + We are barely scratching the surface of regular expressions here so that you + can understand the default Junkbuster + configuration files, and maybe use this knowledge to customize your own + installation. There is much, much more that can be done with regular + expressions. Now that you know enough to get started, you can learn more on + your own :/ + + + + More reading on Perl Compatible Regular expressions: + http://www.perldoc.com/perl5.6/pod/perlre.html @@ -2583,7 +2866,14 @@ communication (bugs, feature requests, etc.) Temple Place - Suite 330, Boston, MA 02111-1307, USA. $Log: user-manual.sgml,v $ +<<<<<<< user-manual.sgml + +======= + Revision 1.8 2001/09/25 00:34:59 hal9 + Some additions, and re-arranging. + +>>>>>>> 1.8 Revision 1.7 2001/09/24 14:31:36 hal9 Diddling.