spugnews - a Usenet Binary Extractor

spugnews is a script intended to extract large binaries from multiple usenet postings. It knows how to reassemble groups of files, and can do rudimentary analysis of articles based on their subject lines to allow you to easily see what's available.

Installation

First, you'll need my Spug Libraries. I also recommend the yenc module, which decodes yenc encoded posts 4 times as fast as the built-in 100% python decoder.

GNUish folks can just do the usual "configure; make; make install" dance. Others should just do whatever is necessary to the spugnews file to make it executable on their systems.

The first time you run spugnews, it will ask you for the name of your usenet server and the directory in which to store files. This information will be stored in a ".spugnewsrc" file in your home directory. This is just a plain python file with some variable definitions in it, so you can directly edit it if you want to change these things.

Usage

spugnews evaluates each of its command line options in sequence. You can specify as many options as you want - some perform actions, some change the programs internal state to affect actions later on.

The "-g" option must be specified prior to any others. This option identifies the usenet group.

"-r" refreshes the article header list for the current usenet group - it retrieves all new headers and deletes information on articles with a reference number earlier than the lower article number stored on the server.

The header list is stored in the file "headers.dat" in a subdirectory of the storage directory named after the usenet group. For example, if you chose the default location for files during installation, the headers.dat file for "alt.binaries.movies" would be "~/news/alt.binaries.movies/headers.dat".

The program maintains an internal "current header set", which is a subset of the header list. Filter operations (such as "-p") modify this subset, allowing you perform operations on a subset of the available headers.

Listing Headers

The "-l" options lists the article number followed by the the subject for each article in the current header set. For example, to list all headers in the group "alt.binaries.movies", you could use:

   spugnews -g alt.binaries.movies -l

The "-p" option allows you to filter the current header set based on a regular expression, so to see all articles with "Happy Gilmore" in the subject line:

   spugnews -g alt.binaries.movies -p 'Happy Gilmore' -l

Note that the -p must come befire the -l: this is because command line options are processed in the order that they are specified, and we want to filter before listing.

Trying to determine the status of files encoded in hundreds of articles can be very time-consuming, so you will generally want to use the -a (analyze) option instead.

Analyzing Article Groups

The "-a" option performs a detailed analysis of the current header set and attempts to group together articles that are part of the same file set based on the contents of the subject line. Unfortunately, there is no universal format used for identifying these kinds of file sets, but most people seem to use one of the following formats:

   prefix file_number "/" num_files bridge part_number "/" num_parts suffix
   prefix file_number " of " num_files bridge part_number "/" num_parts suffix
   prefix part_number "/" num_parts suffix

As such, these are the formats that are recognized by spugnews.

"-a" first lists "rogue" articles: these are articles that it was unable to group with any others. After this, the "article groups" (sets of articles grouped together based on the information in their subject lines) are listed in order of the article number of their first articles. This ordering scheme causes the more recent groups to appear last.

Following each article group description is a list of any issues that were encountered with the set (if any). Typical problems include missing files and missing parts.

Downloading

The "-d" option is used to download articles. It downloads all articles in the current article set that have not already been downloaded, so you generally want to use this with a filter. Articles are downloaded into files of the form "news. article-number" in the group directory.

"-d" is often accompanied by "-x" (extract files) to extract and reassemble multi-part files. For example, to download and extract all parts of "Happy Gilmore":

   spugnews -g alt.binaries.movies -p "Happy Gilmore" -dx

Note that options without arguments can be grouped together (i.e. we used "-dx" instead of "-d -x").

spugnews doesn't make any decisions about when to delete the articles that it has downloaded - you'll want to periodically go into the group directory and delete *.news.

Viewing Articles

Sometimes you just want to view a single article, headers and all. The "-v" option allows you to do this. Specify "-v" with the article number to download, and the entire article will be formatted to standard output:

   spugnews -g alt.binaries.movies -v 1933765

Other Info

The complete set of command line options can be viewed with "spugnews -h".

I wrote spugnews because I didn't like any of the other free newsgrabbers that I looked at and because it bothered my that there didn't seem to be one written in Python. It isn't the best grabber in the world: there are subject formats that it doesn't recognize which it arguably should, it doesn't put together rar files and it would be nice if it had a better user interface and the ability to work more automatically in the background.

However, in spite of these limitations, I find spugnews to be pretty useful. There's a good chance that I'll add features to it as I need them, and there are certainly "big picture" items I'd like to see added (integration into spugmail, perhaps?). However, at this point it is what it is. Hope you like it, and feel free to send me comments and patches.

Release History

1.1 - 2005-11-22

internal yenc decoder (not as fast as the yenc module, but good for the impatient)
work around for problem with hex encoding in newer (2.3+) versions of Python that affects yenc CRC check.
status indicator for header refresh.
incremental header writing (so if there's a problem when in the middle of downloading thousands of headers you don't loose what you've got so far)

1.0

Initial release.

Contact Info

Michael A. Muller mmuller@enduden.com

spugnews Web Page