Opening Note: Please message me if anything in this writeup is confusing or unclear. I'm always willing to answer individual questions, and would like to make this writeup accessible to everyone--callow newbies and seasoned veterans alike.

PHP: Pretty Hellacious Programming

PHP was my second programming language--my first was BASIC. Edsger Dijkstra said,

"The teaching of BASIC should be rated as a criminal offense: it mutilates the mind beyond recovery."
He never said anything about PHP, but it's likely that he would consider my dynamic duo of introductory languages to have turned me into a gruesome parody of a programmer; a twisted, pathetic thing, building sacrilegious code under the auspices of a dark, hidden neurosis.

In practice, PHP is a great language for certain tasks, but understanding CGI is not one of them. When I discovered my second true love, Ruby (my first love having left me to become a lesbian), my first thought was to program web applications with it. However, this wasn't as simple as I thought it would be.

Up until that point, my sole experience with web programming had been PHP with Apache. Pretty much every shared webhost under the sun has mod_php and Apache installed by default, making it very easy to write web applications with PHP: You create a ".php" file, upload it to your website, and whatever your PHP script prints out appears on the page. It was when I tried to apply these assumptions to Ruby that the house of cards came tumbling, tumbling down.

How the Web Works

Let's take a remedial class for a moment. What happens when you type "www.google.com" into your web browser and hit enter?

  1. Your browser finds the nearest DNS, or Domain Name Server, and, like a disgruntled spouse on a family vacation gone awry, asks for directions.
  2. The DNS takes the domain name and finds its corresponding nameserver's IP address, and sends the request that way so it can go back to watching its soaps.
  3. Google's nameservers look at the domain name--"google", and the subdomain--"www", to determine which IP address should receive the request.
  4. One of Google's scrillions of webservers receives the request, just an insignificant blink of transient contact from the outside world. It splits up the request and sees that you have asked for "/", or the main directory.
  5. It finds the index page for that directory in its filesystem, and sends your browser the raw HTML.
  6. It is now your browser's responsibility to take the HTML and turn it into the symbol of modern web-based economic success we all know and love.
Whew! All of this just to find a decent picture of Alyson Hannigan in lingerie. (DISCLAIMER: I haven't actually run that search, so I take no responsibility for what you may actually find.)

You'll notice that I emphasized #5. This isn't just because I have a distressing fixation on the number five, though I must admit that I sometimes find myself distracted by the sultry dip of its lower curve, and the sharp, almost offensive ninety-degree angle jutting salaciously out of its--oh, dear. Excuse me. So, the number fi-... The item after number four. 90% of the magic of CGI takes place in this step.

Apache

Let's drill down, fearless spelunkers of knowledge that we are. Apache is the de facto standard for webservers, a veritable colossus of feature-rich flexibility, though it is recently challenged by cheeky upstarts like lighty, and the mad hatters at OKCupid were motivated to code their own webserver entirely.

When there are no scripting languages or URL tomfoolery enabled, Apache simply takes a request URI, finds the corresponding file on its filesystem, and sends the contents to the browser along with a few terse headers. But what about my precious PHP? How does it fit in?

Apache & PHP (Not CGI)

Suppose Apache receives a request for /generate_erotic_fiction.php (don't judge me!). At this point, mod_php kicks in. mod_php is an Apache module that tells Apache how to handle PHP files. In contrast to mod_cgi, which can "handle" any executable file, mod_php can only "handle" PHP files. So, why use one over the other?

  • mod_php Con: mod_php only enables you to use PHP, not any other language.
  • mod_php Pro: mod_php offloads much of the "integration" work from your PHP file to mod_php itself. This is why it's so easy to get started web programming with PHP: All you have to do is create a new PHP file that echoes some text, and you're done. More on this below.
When Apache receives a request for a .php file, it says, "Let's rock the house with some scripting," calls on the vicarious wisdom of mod_php and evaluates the script, spitting its output back to the browser, so visitors can enjoy the erotic stylings of procedurally generated smut. Simple, huh?

Apache & Not PHP (CGI)

Uh-oh.

It turns out that mod_php happily hides a lot of stuff that goes on under the hood. This is a blessing because it makes web scripting with PHP easy and straightforward. It is a curse, however, because it promotes an incomplete understanding of how CGI works. Any scripting language that wants to generate meaningful websites needs to have an equally meaningful environment set up before it does its stuff.

mod_cgi lets you command Apache to execute files instead of just reading them. Apache gives the program some environment variables to give it some context about who's asking for erotic fiction. It then uses the first few lines of the program's (plain text) output as its headers, and if everything's in order, gives the rest of the output back to the browser.

The big difference between mod_cgi and non-CGI solutions like mod_php is that, when using mod_cgi to run your scripts, you need to massage the environment into ease of use yourself, or with libraries provided by your scripting language.

The Unix Environment

If you're familiar with Linux/Unix computers, you'll be familiar with environment variables, which are special values that exist in the invisible ether of your operating system. Environment variables can be anything, from system-specified details like where to look when executing programs, to user-specified nonsense like what your favorite text editor is.

When Apache executes a CGI application, it tweaks the environment first, by setting some contextual environment variables. CGI applications can then access information about the web request--like the URL, the IP address of the client, and so on--just by examining some environment variables.

Solutions

Okay, okay. I've covered all of this nonsense about why it's hard to work with CGI when you're not using mod_php. Now, for the six readers still with me, I'll talk about how to begin working with CGI.

You'll be able to follow along best if you have access to a Linux/Unix command line: Try getting a VPS at Slicehost, or running an Ubuntu Linux server on an old computer. The fact is, it's a lot harder to learn the inner workings of CGI on a shared host, because you need fine-grained control over your environment.

Level 1

Simply put, a CGI application is any executable program that outputs headers, then content, most often for webservers. So, let's program an extremely minimal example in C.

#include <stdio.h>

int main(void) {

    // Tell Apache that we're serving HTML, 
    // since CGI applications can serve anything (except pork chops).
    printf( "Content-Type: text/html" );
    // Tell Apache to get ready for our content.
    printf( "\n\n" );
    // Output content.
    printf( "Sup guys?" );
    // In Unix world, ending by returning 0 means the program finished successfully.
    return( 0 );
}
Great! Compile this to a filename ending in ".cgi", which is what Apache will, by default, recognize as a CGI application. Now, Apache is kind of slow sometimes, so we need to specifically tell it to execute any CGI applications it finds. We can do this with the
Options +ExecCGI
directive, which can go in an .htaccess file or the configuration file for your site. Now, pointing Apache at this file will produce a page with our salutation.

Level 2

Okay, that was pretty low-level, and I won't make fun of you if you skipped over it. Let's approach this with a scripting language, like Perl, Python or Ruby. Actually, things are going to work pretty much the same: Just create a script that outputs a header, the delimiter, and some text. Let's say we're writing a Ruby script that ends in .rb. Now, we need a two-line .htaccess file block:

Options +ExecCGI
AddHandler cgi-script .rb
Or, if you want to run your program without an extension:
<FilesMatch "^myprogram$">
    ForceType application/x-httpd-cgi
</FilesMatch>

Level 3

Jeez, this is tedious! Manually outputting headers, and we haven't even gotten to environment variables yet! Do we really have to reinvent the wheel? No, of course not! I just enjoy making people who want to learn suffer. No, no, no. We started at ground level to get a firm foundation in what it means to program CGI. But the truth is, there are thousands of great libraries and frameworks already out there. For example, Ruby:

#!/usr/bin/env ruby

require 'cgi'

cgi = CGI.new( "html3" )
cgi.out do
    cgi.html do
        cgi.head{ cgi.title{"TITLE"} } +
        cgi.body do
            cgi.div do
                "Fo' sho."
            end
        end
    end
end
Gasp! No headers! Yes, Ruby's CGI module automagically determines things like headers, and can even help with outputting HTML. This is similar to PHP's mod_php functionality. In fact, there's a mod_ruby, but it doesn't really work.

Level 4

Still higher level are frameworks like Camping and Ruby on Rails. Parallels exist for every language imaginable. The choice can be suffocating--pick one blindly and just jump in!

Performance

So, why did I make you sit through all that bunk about Unix environment variables? Here's one reason: We need to understand performance. If a hundred thousand people are hitting your script every hour, mod_php is going to be much faster than a script running as CGI. Why?

  • mod_php is grafted onto Apache.
    • Apache is always running.
  • mod_cgi runs a program every time Apache gets a request.
    • mod_cgi has to set up the environment variables (and, in the case of scripting languages, the interpreter) every single time someone hits the page.
What now? We need a solution like mod_php for every language, or the handy fastcgi, which lets CGI applications run as constant processes that exist beyond the purview of mod_cgi itself. This is its own fiasco, but it's good to know.