display | more...
Common Gateway Interface. CGI does not have to be programmed in perl, it can be done in C, C++, shell scripts, batch files, Pascal, Your favourite programing language goes here. Perl is most commonly used for it, and as such has a specialised module to make your life easier (CGI.pm)
If you want to do it in another language, make sure that you pass a valid header back, or else you won't get very far. The header for a html page looks like this: Content-type: text/html followed by two newlines (\r\n\r\n).

CGI works by processing the data which you send to it, usually in a form or by some user action.
Commonly the CGI script will communicate with a database for more complex applications, for example, Everything 2 talks to a MySQL database using DBI. (Technically, E2 is not using CGI as it uses mod_perl, which is a Perl API to the Apache webserver.)

For more information on CGI in Perl, see perldoc CGI or go to www.perlmonks.org. If you ever see Matt's Script Archive, stay well clear! This code is terribly written (in Perl 4) and contains many bugs and exploits. London.PM have rewritten these scripts, and you can find them at london.pm.org and Sourceforge as "nms".

Dynamic web pages can also be created using Javascript or Java applets, however this is usually a different sort of "dynamic" (ie, dancing pixies on your web browser as opposed to personalised pages from the website).
You could also interface with a CGI script using a Java applet (or other client side program) instead of plain html.

What xriso is talking about it taint checking. This can be summed up by saying "never implicitly trust user input". In Perl, use the -T flag to make it check for tainted data. You should always verify user input before doing anything with it.
The Common Gateway Interface system was developed fairly early in the history of original NCSA webserver. The goal was to allow for "interactive" web pages in concert with the newly created "HTML forms" feature of Mosaic, and to do so in a roughly generalizeable way. The markup and communication protocols created at the time to govern this process have remained almost entirely unchanged to this day.

Ordinary webservers had until then simply received filenames and transmitted files - rather simple. The NCSA team looked at several alternatives whereby a web page could be generated on the fly, via code, and settled on the quickest and dirtiest solution:

A request for a file arrives. If that file is determined to be a CGI, rather than being sent directly to the client, it is instead executed, provided with the data the client collected from the HTML form (in an encoded format), either via it's stdin or an environment variable. That "CGI Program" does whatever it likes with the data - it is in all respects a normal executable of the web server's host operating system - and then produces output on its stdout channel which is, more or less, sent directly back to the client's web browser.

The entire gamut of executable types has been exploited to operate as CGIs at one time or other: C and C++ programs, shell scripts, and PERL scripts are among the most common.

As idoru mentions, the first thing in that output is formalized to be the mime type of the following data, the accidental omission of which confuses many first time CGI authors. This is an intelligent design decision by the system's designers, as CGI's may generate not just a new web page, but images, audio, video, or any other kind of data which the web browser can handle, on the fly.

CGI's and forms represent a turning point of the World Wide Web as a medium. They quickly became ubiquitous, and the core of that mechanism has become something like the heart of the functioning web.

As site designers and developers quickly discovered, the mechanism of the CGI itself, which refers specifically to the interface between the standalone executable and the webserver, was ineffecient to use on a large scale, since processing each page request would require a fork and an exec call, churning OS resources to start and then eventually terminate an (often large) executable process over and over again for each hit, filling process tables and swap space, and in general making poor use of the way Unix allocates resources.

Many web application designers would address this problem by going on to create custom webservers which would directly perform the specific kind of automation they required; many of the web's most popular sites are the result of this custom ground-up programming.

As general purpose webserver designs matured, however, almost all of them drifted towards a "module" system as a kind of middle ground. Thus, the standalone CGI executable was replaced by a library, integrated either statically at compile time or on demand at runtime. These module interfaces typically included a more sophisticated interface to the webserver's various resources, especially shared memory and persistence. A majority of the automation now operating on the internet works via modules - a significant scalability win. Module-driven automation is not technically a CGI at all, since it does not at any time refer to the Common Gateway Interface; however, the term "CGI" has become synonymous with web-based applications, and is often misapplied to refer to them all.

It is also worth mentioning the evolution of the CGI-based script into the incarnation it enjoys today - for instance, on this site. Shell scripts in general and PERL scripts in particular suffered from the efficiency problems of CGI invocation; the repetitive invocation perl's often large runtime interpreter to handle each page request was was a scalability nightmare that was bringing many sites to their knees under load conditions. However, PERL, as well as a number of other scripting languages (some, like PHP, were designed more recently, explicitly for creating CGI's) were too cheap and convienient to give up easily.

The solution eventually settled on by the industry has been the script interpreter web server module, containing a single persistent interpreter which handles all the transactions. Only one copy (per server process) need be kept in memory, it is initialized only once, at server start, and it can theoretically afford variable continuity between transactions. Apache's mod_perl is an excellent example. The vast majority of web automation written today is written in a scripting language against this or a similar type of system (PHP, ColdFusion, etc).

What you must remember when writing CGI: the client can give you anything. Never put the query string in some sort of "eval" call, because it can be easily exploited to execute evil code. Treat the query as if it is a live bomb. Don't just drop it into your environment. You must carefully take it apart, & by & (And remember to always cut the red wire, and never the green one).

Once you have taken the query apart, and have put all the names and values in their individual strings, you must then go to each string and decode the percent signs (%2A -> hex code 42 -> '*').

Also, in a "comments" system, the comment can contain nasty little suprises (eg. <img src="http://olsentwins.com/photogallery/images/120_small.jpg">). These profane comments are the reason that you must disallow many types of tags. Always beware the evil query.

Opening Note: Please message me if anything in this writeup is confusing or unclear. I'm always willing to answer individual questions, and would like to make this writeup accessible to everyone--callow newbies and seasoned veterans alike.

PHP: Pretty Hellacious Programming

PHP was my second programming language--my first was BASIC. Edsger Dijkstra said,

"The teaching of BASIC should be rated as a criminal offense: it mutilates the mind beyond recovery."
He never said anything about PHP, but it's likely that he would consider my dynamic duo of introductory languages to have turned me into a gruesome parody of a programmer; a twisted, pathetic thing, building sacrilegious code under the auspices of a dark, hidden neurosis.

In practice, PHP is a great language for certain tasks, but understanding CGI is not one of them. When I discovered my second true love, Ruby (my first love having left me to become a lesbian), my first thought was to program web applications with it. However, this wasn't as simple as I thought it would be.

Up until that point, my sole experience with web programming had been PHP with Apache. Pretty much every shared webhost under the sun has mod_php and Apache installed by default, making it very easy to write web applications with PHP: You create a ".php" file, upload it to your website, and whatever your PHP script prints out appears on the page. It was when I tried to apply these assumptions to Ruby that the house of cards came tumbling, tumbling down.

How the Web Works

Let's take a remedial class for a moment. What happens when you type "www.google.com" into your web browser and hit enter?

  1. Your browser finds the nearest DNS, or Domain Name Server, and, like a disgruntled spouse on a family vacation gone awry, asks for directions.
  2. The DNS takes the domain name and finds its corresponding nameserver's IP address, and sends the request that way so it can go back to watching its soaps.
  3. Google's nameservers look at the domain name--"google", and the subdomain--"www", to determine which IP address should receive the request.
  4. One of Google's scrillions of webservers receives the request, just an insignificant blink of transient contact from the outside world. It splits up the request and sees that you have asked for "/", or the main directory.
  5. It finds the index page for that directory in its filesystem, and sends your browser the raw HTML.
  6. It is now your browser's responsibility to take the HTML and turn it into the symbol of modern web-based economic success we all know and love.
Whew! All of this just to find a decent picture of Alyson Hannigan in lingerie. (DISCLAIMER: I haven't actually run that search, so I take no responsibility for what you may actually find.)

You'll notice that I emphasized #5. This isn't just because I have a distressing fixation on the number five, though I must admit that I sometimes find myself distracted by the sultry dip of its lower curve, and the sharp, almost offensive ninety-degree angle jutting salaciously out of its--oh, dear. Excuse me. So, the number fi-... The item after number four. 90% of the magic of CGI takes place in this step.


Let's drill down, fearless spelunkers of knowledge that we are. Apache is the de facto standard for webservers, a veritable colossus of feature-rich flexibility, though it is recently challenged by cheeky upstarts like lighty, and the mad hatters at OKCupid were motivated to code their own webserver entirely.

When there are no scripting languages or URL tomfoolery enabled, Apache simply takes a request URI, finds the corresponding file on its filesystem, and sends the contents to the browser along with a few terse headers. But what about my precious PHP? How does it fit in?

Apache & PHP (Not CGI)

Suppose Apache receives a request for /generate_erotic_fiction.php (don't judge me!). At this point, mod_php kicks in. mod_php is an Apache module that tells Apache how to handle PHP files. In contrast to mod_cgi, which can "handle" any executable file, mod_php can only "handle" PHP files. So, why use one over the other?

  • mod_php Con: mod_php only enables you to use PHP, not any other language.
  • mod_php Pro: mod_php offloads much of the "integration" work from your PHP file to mod_php itself. This is why it's so easy to get started web programming with PHP: All you have to do is create a new PHP file that echoes some text, and you're done. More on this below.
When Apache receives a request for a .php file, it says, "Let's rock the house with some scripting," calls on the vicarious wisdom of mod_php and evaluates the script, spitting its output back to the browser, so visitors can enjoy the erotic stylings of procedurally generated smut. Simple, huh?

Apache & Not PHP (CGI)


It turns out that mod_php happily hides a lot of stuff that goes on under the hood. This is a blessing because it makes web scripting with PHP easy and straightforward. It is a curse, however, because it promotes an incomplete understanding of how CGI works. Any scripting language that wants to generate meaningful websites needs to have an equally meaningful environment set up before it does its stuff.

mod_cgi lets you command Apache to execute files instead of just reading them. Apache gives the program some environment variables to give it some context about who's asking for erotic fiction. It then uses the first few lines of the program's (plain text) output as its headers, and if everything's in order, gives the rest of the output back to the browser.

The big difference between mod_cgi and non-CGI solutions like mod_php is that, when using mod_cgi to run your scripts, you need to massage the environment into ease of use yourself, or with libraries provided by your scripting language.

The Unix Environment

If you're familiar with Linux/Unix computers, you'll be familiar with environment variables, which are special values that exist in the invisible ether of your operating system. Environment variables can be anything, from system-specified details like where to look when executing programs, to user-specified nonsense like what your favorite text editor is.

When Apache executes a CGI application, it tweaks the environment first, by setting some contextual environment variables. CGI applications can then access information about the web request--like the URL, the IP address of the client, and so on--just by examining some environment variables.


Okay, okay. I've covered all of this nonsense about why it's hard to work with CGI when you're not using mod_php. Now, for the six readers still with me, I'll talk about how to begin working with CGI.

You'll be able to follow along best if you have access to a Linux/Unix command line: Try getting a VPS at Slicehost, or running an Ubuntu Linux server on an old computer. The fact is, it's a lot harder to learn the inner workings of CGI on a shared host, because you need fine-grained control over your environment.

Level 1

Simply put, a CGI application is any executable program that outputs headers, then content, most often for webservers. So, let's program an extremely minimal example in C.

#include <stdio.h>

int main(void) {

    // Tell Apache that we're serving HTML, 
    // since CGI applications can serve anything (except pork chops).
    printf( "Content-Type: text/html" );
    // Tell Apache to get ready for our content.
    printf( "\n\n" );
    // Output content.
    printf( "Sup guys?" );
    // In Unix world, ending by returning 0 means the program finished successfully.
    return( 0 );
Great! Compile this to a filename ending in ".cgi", which is what Apache will, by default, recognize as a CGI application. Now, Apache is kind of slow sometimes, so we need to specifically tell it to execute any CGI applications it finds. We can do this with the
Options +ExecCGI
directive, which can go in an .htaccess file or the configuration file for your site. Now, pointing Apache at this file will produce a page with our salutation.

Level 2

Okay, that was pretty low-level, and I won't make fun of you if you skipped over it. Let's approach this with a scripting language, like Perl, Python or Ruby. Actually, things are going to work pretty much the same: Just create a script that outputs a header, the delimiter, and some text. Let's say we're writing a Ruby script that ends in .rb. Now, we need a two-line .htaccess file block:

Options +ExecCGI
AddHandler cgi-script .rb
Or, if you want to run your program without an extension:
<FilesMatch "^myprogram$">
    ForceType application/x-httpd-cgi

Level 3

Jeez, this is tedious! Manually outputting headers, and we haven't even gotten to environment variables yet! Do we really have to reinvent the wheel? No, of course not! I just enjoy making people who want to learn suffer. No, no, no. We started at ground level to get a firm foundation in what it means to program CGI. But the truth is, there are thousands of great libraries and frameworks already out there. For example, Ruby:

#!/usr/bin/env ruby

require 'cgi'

cgi = CGI.new( "html3" )
cgi.out do
    cgi.html do
        cgi.head{ cgi.title{"TITLE"} } +
        cgi.body do
            cgi.div do
                "Fo' sho."
Gasp! No headers! Yes, Ruby's CGI module automagically determines things like headers, and can even help with outputting HTML. This is similar to PHP's mod_php functionality. In fact, there's a mod_ruby, but it doesn't really work.

Level 4

Still higher level are frameworks like Camping and Ruby on Rails. Parallels exist for every language imaginable. The choice can be suffocating--pick one blindly and just jump in!


So, why did I make you sit through all that bunk about Unix environment variables? Here's one reason: We need to understand performance. If a hundred thousand people are hitting your script every hour, mod_php is going to be much faster than a script running as CGI. Why?

  • mod_php is grafted onto Apache.
    • Apache is always running.
  • mod_cgi runs a program every time Apache gets a request.
    • mod_cgi has to set up the environment variables (and, in the case of scripting languages, the interpreter) every single time someone hits the page.
What now? We need a solution like mod_php for every language, or the handy fastcgi, which lets CGI applications run as constant processes that exist beyond the purview of mod_cgi itself. This is its own fiasco, but it's good to know.

Log in or register to write something here or to contact authors.