ubh - The Usenet Binary Harvester

Index

Home

Past News

Screenshots

Documentation

Download

Mailing Lists

Links


Source Forge

Emacs

Template Toolkit

ubh - The Usenet Binary Harvester


NAME

    ubh - The Usenet Binary Harvester

SYNOPSIS

    ubh -c file (-S|-M) -C -d -D -g [(-a|-s)]|[(-f|-l) num] -i [(-I|-X) pattern] -L -r -u -w -y -Z -z

DESCRIPTION

    ubh is the Usenet Binary Harvester, a Perl program which automatically discovers, downloads, and decodes single and multi-part binary Usenet postings.
    ubh decodes single and multi-part uuencoded binaries.
    ubh also decodes single part MIME base64-encoded image, audio, and video attachments, and application/octet-stream attachments. It also combines and decodes multi-part message/partial binaries.
    ubh uses a standard .newsrc file to control which groups to process, and uses the .newsrc to keep track of articles already processed.
    You can specify search filters to select articles to download via Perl regular expression syntax.
    By default ubh will only consider articles with a well-formed Subject header. A well-formed Subject header is one that contains a file name with an extension which matches the extension filter. Multi-part Subject headers will be recognized if they contain a part/total designator of the form [m/n] or (m/n), where m is the part number and n is the total number of parts. Part numbers begin with 1 and may or may not contain leading zeroes. A part number of 0 is ignored.
    ubh provides an interactive article preselection option to allow you to preview the Subject: headers for multi-part binaries, and specify which binaries you wish to download.
    ubh runs equally well under Unix-based Perl or Active Perl on Win32 platforms.
    ubh requires Net:NNTP and News:Newsrc (which itself requires Set::IntSpan). See the section INSTALLATION for details.

OPTIONS

    These options apply to single part article processing:

      Option Description
      -S Process only single part articles.
      -g This option ("g" for "greedy") will download and process each unread article even if the subject does not contain a filename which matches the single part extension filter.

    These options apply to multi-part article processing:

      Option Description
      -M Process only multi-part articles.
      -i Interactive preselection of multi-part articles.

    These options apply to both single part and multi-part article processing:

      Option Description
      -C

      Changes all non-alphanumeric characters in a filename to '_'.

      This will eliminate spaces in filnames, as well as all other undesireable (and possibly illegal) characters.

      -d

      Diagnostic mode. Downloads and writes all unread articles in raw form. This occurs prior to single and multi-part filtering. It's very useful to look at the raw articles to see why they are failing to be selected for downloading. Helpful for reverse-engineering new or bizarre encoding formats.

      You can also use this to perform your own post-processing directly on the raw articles.

      -D

      Multi-part diagnostic mode. Downloads and writes all selected, complete multi-part articles. Very useful to look at the raw articles to see why they are failing to be unencoded. Helpful for reverse-engineering new or bizarre encoding formats.

      You can also use this to perform your own post-processing directly on the raw articles.

      -c file

      Use file as configuration file, instead of the default.

      On Win32 platforms, the default is ubhrc. On Unix platforms, the default is .ubhrc.

      -a

      Process all articles, but disregard the newsrc (ie, consider all articles even if they are marked as read in the newsrc, and do not catch up the group at the end of processing of the group).

      -f num

      Process the first num articles. Disregards newsrc.

      -l num

      Process the last num articles. Disregards newsrc.

      -s

      Log all subjects to subjects.log. Log multi-part subjects to multiparts.log. Doesn't download anything. Disregards newsrc.

      -I regexp

      Inclusion search filter (double quote on command line). regexp is any valid Perl regular expression.

      -X regexp

      Exclusion search filter (double quote on command line). regexp is any valid Perl regular expression.

      -L

      Long filenames - uses the article subject as the filename. This makes life easier because many folks encode their files with terribly vague filenames.

      -r

      Logs rejected subjects to rejects.log in the group directory. Logs rejected single part and multi-part articles. Excellent for quality control to see if ubh is rejecting any binaries, and essential for diagnosing why articles are being rejected. Normally the rejected articles will contain SPAM or discussion, but occasionally ubh will reject an arcane format or mal-formed MIME-formatted message.

      Rejected multi-part articles logged with this option in are in their assembled stated, prefaced by the headers from the first article.

      -u

      Prints out a brief usage summary.

      -w

      Prints out warranty information.

      -y

      chmod 0666 on all output files.

      -Z

      Produces lots and lots of logs.

      -z

      Marks articles that don't pass inclusion/exclusion as read. This cleans up the .newsrc dramatically.


INSTALLATION

    ubh requires the following Perl modules: Net::NNTP, News::Newsrc, and Set::IntSpan.

    These modules may already be installed on your system. To determine whether these modules have already installed, use the following commands:

      perl -e "use Net::NNTP;"
      perl -e "use News::Newsrc;"
      perl -e "use Set::IntSpan;"
          

    Net::NNTP is part of the libnet distribution, which provides an entire family of networking modules.

    If you need to install one or more of these modules, follow the instructions below for your particular platform.

    Win32 / ActivePerl

      You will use the Perl Package Manager (PPM) to download and install the modules. At the PPM prompt, enter the following commands:

        install libnet
        install News-Newsrc
        install Set-IntSpan
              

    Unix

      You will download the modules from your nearest CPAN mirror.

        modules/by-module/Net/libnet-1.0607.tar.gz
        modules/by-module/Set/Set-IntSpan-1.07.tar.gz
        modules/by-module/News/News-Newsrc-1.07.tar.gz
              

      Feel free to use more recent versions of these modules, if available.

      For each of those files, execute the following commands:

        gunzip < module.tar.gz | tar xvf -
        cd module
        perl Makefile.PL
        make
        make test
        make install
              

      For more details, review the installation instructions provided with each module.


SETUP

    You specify which groups to process with a standard newsrc file. The format of a newsrc file is one newsgroup name per line, followed by a colon. ubh will update this file and place the numbers of articles it has already processed for a given group on that group's line in the newsrc file after the colon. The colon character denotes that you are subscribed to the newsgroup. You can un-subscribe from a group by changing the colon to an exclamation point; this will cause ubh to skip over that newsgroup during processing.

    Here is an example newsrc file:

      alt.binaries.sounds.mp3.1990s: 15558-23146
      alt.binaries.sounds.mp3.1980s: 30139-35021
      alt.binaries.pictures.autos: 
      alt.binaries.pictures.cartoons! 63671-64660
      alt.binaries.pictures.hockey: 1-1406
          

    You specify program options with a ubhrc file. On Win32 systems, the default ubhrc file name is ubhrc. On Unix systems, the the default ubhrc file name is .ubhrc.

    Here are the available options. Specify one option per line, OPTION = value. Comments begin with #.

      Keyword Description
      NNTPSERVER Specifies the name of the news server to connect to. Default is news.
      NNTPRETRIES Number of times to retry NNTP commands.
      NEWSRCNAME The name of the newsrc file. On Win32 systems, the default newsrc file name is newsrc. On Unix systems, the default newsrc file name is .newsrc.
      DATADIR Directory to store downloaded group subdirectories and downloaded binaries. Default is data.
      FORCEDIR Forces output to a specific directory instead of the default data/newsgroup-name. Handy for multiple newsgroups which hold similar stuff.
      MULTI_EXT Multi-part article file extensions. Any valid Perl regular expression. Default is (?i)asf|avi|gif|jpg|mov|mpg|mpeg|rm|rar.
      SINGLE_EXT Single part article file extensions. Any valid Perl regular expression. Default is (?i)asf|avi|gif|jpg|mov|mpg|mpeg|rm|rar.
      EXTENSIONS Sets both MULTI_EXT and SINGLE_EXT.
      ACCOUNT and PASSWORD News server account and password. You must define both of these if your server requires them. By default, ubh will access the server without authentication.
      OPT_g
      OPT_i
      OPT_d
      OPT_D
      OPT_a
      OPT_f
      OPT_l
      OPT_r
      OPT_s
      OPT_S
      OPT_M
      OPT_I
      OPT_X
      OPT_C
      OPT_L
      OPT_z
      OPT_Z
      OPT_y
      All of the command-line options (except -u, -c, and -w) can be set in the ubhrc file, using OPT_x where x is the corresponding command line option letter.

    Here is an example ubhrc file. This can be used to harvest single and multi-part binaries in the pictures and multimedia newsgroups.

      NNTPSERVER = binaries.newsfeeds.com
      NEWSRCNAME = newsrc_nfdc
      DATADIR    = data
      MULTI_EXT  = (?i)asf|avi|gif|jpg|mov|mpg|rm
      SINGLE_EXT = (?i)asf|avi|gif|jpg|mov|mpg|rm
      ACCOUNT = bart
      PASSWORD = the+simpsons
      OPT_g = 1
          

    Here is another example ubhrc file. This demonstrates how to search for MP3s by a particular set of artists:

      NNTPSERVER = mp3.newsfeeds.com
      NEWSRCNAME = newsrc_mp3
      DATADIR    = data
      MULTI_EXT  = (?i)mp3
      SINGLE_EXT = dontcare
      OPT_a = 1
      OPT_M = 1
      OPT_I = (?i)oasis|horton.*heat|smiths|morrissey|bjork|foo.*fighters|green.*day
          


USAGE EXAMPLES

    Here are some command-line usage examples:

      ubh -i -M -I "(?i)rem|r\.e\.m\.|u2|korn|hanson"
      ubh -S -l 1000
          


CREDITS


WEBPAGE


COPYRIGHT and LICENSE

    Copyright © 2000 Gerard Lanois
    gerard@users.sourceforge.net
    P.O. Box 507264 San Diego, CA 92150-7264

    This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

    You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA


gerard@users.sourceforge.net