Running Sitescooper


Running the Command Itself

The easiest way to get started with sitescooper is to simply run it: (UNIX users can leave out the perl command at the start of the line, but will have to provide the correct path to the sitescooper.pl script. Linux users installing using the RPM get sitescooper in their path, so they can just type sitescooper to run it.)

The first time you run sitescooper, it will pop up a list of sites in your editor, and ask you to pick which sites you wish to scoop. This creates a file in your temporary directory called site_choices.txt with these choices. Your temporary directory is the .sitescooper subdirectory of your home directory on UNIX, or C:\Windows\TEMP for Windows users; this can be changed by editing the built-in configuration in the script.

Once you've chosen some sites, it'll run through them, retrieve the pages, and convert them into iSilo format. See Changing Output Format if you wish to change this.


The Sites Directory

Versions of sitescooper before 2.0 used a different mechanism to choose sites; instead of picking them from a list, you had to copy them manually from the site_samples directory into a sites directory, and sitescooper would use the site files in that directory. This is still supported in 2.0; if there are any site files in the sites directory, they'll be read and those sites will be downloaded when you run sitescooper. If you're a pre-2.0 user and don't want to keep doing things that way, just delete those files.


Selecting Sites On The Command Line

If you want to scoop from one specific site, you can use the -site argument to do this. Provide the path to the site file and sitescooper will only read that one site. Multiple sites can be chosen by providing multiple -site arguments, one before each site name, or by using the -sites switch:


Scooping a URL Without a Site File

Let's say you want to scoop a URL which doesn't have a site file written for it. Run sitescooper with the URL on the command line and it will scoop that URL, tidy up the HTML as best it can without knowledge of the site's layout, and convert it into the chosen format. You can even provide a limited form of multi-level scooping with this, using the -levels and -storyurl arguments: Personally, I think this is only handy when prototyping a site file, but it's possible nonetheless.


Stopping Sitescooper From Being Too Clever

Sitescooper includes logic to avoid re-scooping pages you've already read. Sometimes, however, you will want it to do so; if this is the case, use the -refresh argument. This will cause sitescooper to ignore any historical accesses of the site and will scoop the lot. However it will use any cached pages it has already loaded. This is very handy when you're writing a site file.


Stopping Sitescooper From Being Too Clever, pt. 2

-refresh uses the cached pages where possible. This is not always what you want, as the page may have changed since the last load, but the cached copy has not expired. To avoid this, use the -fullrefresh argument. This will cause sitescooper to ignore any historical accesses of the site and will scoop the lot, ignoring any cached pages and reloading every page from the source (unless the -fromcache argument is used).


What's This "File Size Limit Exceeded" Message?

Sitescooper imposes a limit of 300 kilobytes on the HTML or text scooped from any one site; otherwise it's quite easy to produce site files which can generate a 800K PRC file in one sitting!

By the way, note that the resulting PRC files may be well under 300Kb in size; sitescooper imposes the limit on the raw HTML or text as it goes along, and it's entirely plausible that the conversion tools used might do a great job of compressing the data.

Also it should be noted that often, when you hit the limit on a site, the missed stories will simply be scooped next time you run the script. This depends on the site file though.

If you want to increase the limit, use the -limit argument:

will scoop your chosen sites, with a limit of 500Kb.


Changing Output Format

Currently, these are your options. The command-line switch is provided in bold text after the description. It's also possible to run any command you like to convert the resulting output; see the documentation for the -pipe switch if you're interested in this.

If you want to convert to multiple output formats, you need to run sitescooper once for each output format, and use a shared cache between the separate invocations. Ask on the mailing list for more information on this.


Selective Scooping Using Profiles

Story profiles are a way of scooping sites for a particular set of words or patterns. If the words in question don't appear in a story, that story will not be scooped. (This functionality was contributed by James Brown <jbrown /at/ burgoyne.com> - thanks James!)

Here's a sample profile file, as an example:

	Name: Bay Area Earthquakes

	Score: 10

	Required: san jose, earthquake.*, (CA|California)

	Desired: Richter scale, magnitude, damage, destruction, fire,
		flood, emergency services, shaking, shock wave

	Exclude: soccer
And here's James' description of the format:
A profile contains the following: Obviously, one or both of 'Required' or 'Desired' must be present or it wont match anything.

The score is a minimum value that must be matched (basically a percentage of keyword hits vs. number of sentences). The required keywords must be present or the story does not match. The desired keywords give hits about what is interesting. The more desired keywords that match, the higher the story scores. The exclude keywords will cause a story not to match if they are present.

All of the keywords (required, desired, exclude) can be phrases and all are processed as PERL regular expressions so they can be quite complex if needed. Keywords are separated by either a comma or a newline. Scouts.nhp is probably the richest example of what can be done with a profile (includes regular expressions).

I added an "IgnoreProfiles" command to the site file definition to allow users to scoop the entire site rather than just the stories that match.

To turn on Profile mode, use the -grep argument when running sitescooper; any sites that do not contain IgnoreProfiles: 1 will then be searched for the active profiles.

To use a profile, create a directory called profiles, and set the ProfilesDir parameter in the sitescooper.cf file to point to that directory. Now copy in the profiles you are interested in from the profile_samples directory of the distribution. UNIX users should look in /usr/share/sitescooper or /usr/local/share/sitescooper if you're not sure where sitescooper has been installed. Edit the profiles to taste, and run

You can also use the -profile or -profiles switches to specify individual profiles you wish to use, without requiring a ProfilesDir directory to be set up. These switches have the same semantics as the -site and -sites switches.


What's Going Wrong?

If sitescooper is acting up and not doing what it's supposed to do, try the -debug switch. This will turn on lots of optional debugging output to help you work out what's wrong. This is very very handy when writing a site file.

There's also a -nowrite argument which will stop sitescooper from writing cache files, already_seen entries, and output files.

If the worst comes to the worst, you can get sitescooper to copy the HTML of every page accessed to a journal file using the -admin journal switch. This HTML is logged, first, in its initial form straight from the website, secondly, after the StoryStart and StoryEnd has been stripped from the page, and finally, as text. This is handy for debugging a site file, but is definitely not recommended during normal use, as a big site will produce a lot of journal output.

If you have all the files in your cache, use the -fromcache switch and network accesses will be avoided entirely. This is handy for debugging your site offline, or for producing output in multiple formats from the same files, if you have a shared cache set up.


Getting The Output

Normally, the output from sitescooper is written to the installation directory of your Pilot Desktop software, where possible. If you want the output directly, use the -dump switch. This will cause readable formats (text, HTML) to be dumped directly to standard output, i.e. to the screen, or to a file if you've redirected stdout.

-dumpprc does the same thing for the binary formats, such as DOC, iSilo, M-iSilo or RichReader. Note that multi-file formats such as M-HTML don't get dumped either way; the path to the file which contains the first page of the output is printed instead.

Some versions of Windows perl have difficulty redirecting stdout, so the -stdout-to argument allows the same thing to be done from within the script itself.


The HTML Output Index

Scoops generated using -html or -mhtml get a bonus feature; an index page will be generated in the txt sub-folder of the temporary folder, listing all currently-existing HTML output.


Output Renaming

By default, the output files are generated with the current date in the filename and in the title used in the PRC file (when a PRC file is generated). Use the -prctitle and -filename arguments to change this behaviour. More information on this can be found in the command-line reference documentation.


[ README ]
[ Installing ]|[ on UNIX ]|[ on Windows ]|[ on a Mac ]
[ Running ]|[ Command-line Arguments Reference ]
[ Writing a Site File ]|[ Site File Parameters Reference ]
[ The rss-to-site Conversion Tool ]|[ The subs-to-site Conversion Tool ]
[ Contributing ]|[ GPL ]|[ Home Page ]