Web Site Resource Downloader and Analyzer

 

Project Summary

 

Purpose:

Research Project

Status:

Experimental

Availability:

See our download page

Documentation:

See user’s manual

 

 

 

 

 

Background Information

 

CallipygianGrab is a research project to apply

·         spam-recognition technology, including Bayes’ Theorem,

·         the latest in artificial intelligence technology, including Feed-Forward Neural Networks

·         innovative techniques to reduce time wasted downloading unwanted material

 

…in an application to search websites for image assets.

 

The software is trainable. When CallipygianGrab starts spidering, you are presented with images that meet your basic criteria for dimension and file size. Through a photo-album user interface, you can rapidly review the images and decide which ones to keep or discard.

 

Images are sorted in the queue in order of predicted relevance. As you rate photos, the relevance may change, so the interface provides a re-sort button sort the already-pending images. A simple traffic-light indicator shows CallipygianGrab’s guess about the relevance of the image: Green is relevant, Yellow is borderline, and Red is not relevant.

 

Figure 1 - The "Preview" Panel. The software is being trained to recognize photos related to football

 

Above, Figure 1 shows the preview panel, one of four main panels in the application. You can scroll through the filmstrip panel on the bottom and select images to keep or discard. CallipygianGrab tracks the images you keep and discard by analyzing features on the images, and by analyzing the content of the page that refers to or includes the image.

 

The software maintains a wordlist and computes Baysean probabilities of the words a page containing images you’re looking for compared to a page with images you aren’t looking for. This data is used in two ways:

 

1.      To prioritize images to download

2.      To generate new search engine queries containing highly ranked words

 

Often, it’s advantageous to let the computer decide what you’re searching for instead of trying it yourself. Very often relevant words are non-obvious to a human searcher. The Bayesean algorithm quickly identifies good candidate words with which to perform new queries.

 

Other Program Features

The program supports a “minimize” mode where just the filmstrip is visible:

 

 

You can leave this running all the time in the corner of your screen. The software is designed to run continuously, reducing CPU load and network usage when there’s an interactive user on the PC. It’s been tested to run for weeks at a time with no memory leaks or crashes.  And you can exit it, and resume the search where you left off. All data and state are stored in XML files.

 

 

 

Your starting location can be a URL (shown above), a file containing lists of URLs, or your browser history.

 

 

 

Some of the options are shown here. You can set a maximum # of files per site, and minimum width and height, etc. Note that the software does an “early out” if the file doesn’t meet the minimum width and height! It will only read just enough of the file to get the dimensions from the Jpeg or Gif header! If it’s too small or too big, the download is aborted without wasting time. This makes for very fast spidering—many times faster than existing website downloader products.

 

Callipygian 3D Sight and Sound © 2004

© 2004, Robert A. Swirsky. All Rights Reserved