This is a procedure that you might need to use if you need to back up a moderately complex web-site in a hurry. It is not foolproof, for example it will not adequately handle javascript rollovers, nor will it deal with web-sites that use POST method or forms for navigation; but since most of the time pages link to other pages with simple '<a href=>' links, this method can be quite good. It will work 100% on 80% of websites.

I had to do this earlier today for a client who's previous web programmer decided to take a round the world vacation without first handing over the system or providing any documentation. This technique allowed me to backup the site and create an identical looking site in about 20 minutes.

This is also a technique for publishing scripted pages on free hosts. Supposing your original site makes use of Ldap servers, Database queries and other kinds of server side complexity - the copy site gives a similar appearance without any of the internal complexity.

What we are going to do:

We are going to use wget to traverse every linkable page of the source web-site, and have it copy the HTML and graphical content of each page into static files. Next we are going to use apache's mod rewrite and a tiny PHP (or Perl) script to serve up those pages and create the illusion that they are still running dynamically.

What you need:

  1. A functioning apache 1.3 webserver to run the copy site.
  2. Enough disk space to hold static versions of every possible page and image that can appear on the site.
  3. The unix command line utility 'wget'; for copying the web pages over.
  4. A basic text editor - I like jedit.
  5. PHP (or similar) installed as an apache module.

Step 1:

Make a new folder to store the new web pages you are about to copy. On a Linux computer this will usually be somewhere within /var/www/html. Change to that directory.

Step 2:

Use wget to copy the website over to your computer.

wget -r -t5 http://foo.net/ -o download.log

This means: Recursivly download everything you can find on foo.net, if you cannot fetch something keep trying 5 times and record all the progress in the file called "download.log".

Step 3:

Setup your apache virtual hosts file and your local DNS server (or /etc/hosts file) so that you can see the website you have just copied over on a convenient URL on your computer. This makes it easy to find your copied web-site and test the next step.

Step 4:

In any folder where you can find HTML pages, insert a .htaccess script that looks something like this:

#Beginning
RewriteEngine on
Options +FollowSymlinks

RewriteRule (.+) page.php
#End

This says, if any pages other than "/" (the default page) are requested from this directory, rather than attempting to display the specified page, just run a script called page.php.

Page.php needs to be something like this:
<?

// This is the folder from which all relative URLs are derived.
$base_path="/where/to/find/your/page/";

// This is the filename that I shall retrieve.
if (! is_null( $_SERVER["REQUEST_URI"] ) )
    {
    $file = $base_path . $_SERVER["REQUEST_URI"];
    } 
else
    {
    die ("No file to get");
    }

// Comment out this next line once you have it working, it's a security risk.
echo "<!-- This content was read from: ".$file."-->";
$fp=fopen($file, "r");

// Limit the page length to aprox 100kb
echo fread( $fp, 100000 );
?>

Step 5: Restart your apache server.

On a red-hat Linux box do:

/etc/init.d/httpd restart

Job done!


Update: I've recieved a number of messages outlining more elegant solutions than the one I propose. PERL programmers will appreciate w3mir, a package that does almost exactly the same thing but with many extra features.

Log in or register to write something here or to contact authors.