A Guide to Web Authentication Alternatives

Copyright Dec 1997, Jan Wolter
Last Updated Oct 2003.

Previous Sections:
1. Table of Contents and Introduction
2. Authentication Options Supported by HTTP Servers and Browsers

3. Do-It-Yourself Authentication Options

Site developers are by no means limited to using the standard authentication protocols. At one extreme, you could have the browser load a Java Applet which would open a connection back to your site and use any protocol you care to invent. Presumably Active-X tools and plug-ins could be devised to do similar things.

However, in this document I'm going to presume that we are using only the basic built-in capabilities that nearly all browsers share - forms and cookies and such-like. This still leaves many alternatives. Though none are significantly more secure than basic authentication, they do have other advantages.

Considerable caution is needed, however, to ensure that your authentication system isn't far less secure than basic authentication.

3.1. Where to Do Authentication

The first question, is what program on the server is going to do the actual authentication. Authentication can be checked either directly by the HTTP server daemon or by a CGI program run by the HTTP server daemon.

We will first discuss the tradeoffs between checking authentication in CGI programs rather than in the HTTP server itself. Then we will turn to the relatively simple problem of getting a login and password from the user and checking that they are valid. Finally we will discuss the more difficult problem of tracking the user from page to page.

3.1.1. Authentication in HTTP Server Daemon

Normally only basic authentication and (maybe) digest authentication are built into HTTP servers. But with modular servers like Apache, it is possible to add special purpose functionality into your HTTP server. This is really the place where authentication logically belongs. From the HTTP server, you can control access not just to CGI programs, but to static HTML pages as well. However, although Apache is very well designed and comparatively easy to modify in this way, it is still a fairly challenging programming task.

One problem with this alternative is that your password database must be readable to the HTTP server. In most installations all CGI programs run as the same user as the HTTP server, so your password database will be readable to every CGI on your system. If just one has a major security hole, you could be compromised (and it is astonishingly easy to put a major security hole into a CGI program).

Another problem with authenticating through the HTTP server is that the outcome of a login attempt is strictly binary - success or failure. In some applications, you may want to handle different failure modes differently. For example, you may want to give the user different kinds of feedback if the login was incorrect, or the password was incorrect or both were correct but the account has expired, etc. This can not be easily done with HTTP-based authentication.

3.1.2. Authentication in CGI Programs

CGI programs are easy to write, and putting authentication into them is easier than modifying your HTTP server. If you do this, however, you must put authentication testing into all your CGI programs, and you must access everything through CGI programs. If you have static HTML pages that you want to restrict, then you will need to move them outside the document root and write a CGI program that checks authentication and send the file only if the authentication works. This may not be especially difficult, but it isn't as clean a design as letting the HTTP server do the checking.

Although CGI programs are generally easy to write, any program that handles sensitive information like passwords must be written with great care. Having your CGI programs handle passwords means they become sensitive from a security viewpoint, and thus adds significantly to development cost. On some systems, there are also be security risks involved in passing authentication information from the HTTP server to the CGI program for checking.

3.2. Collecting and Checking Login Information

Most do-it-yourself authentication systems are set up so that if an un-authenticated user tries to run any CGI program, the user gets re-directed to a page containing a login form. Typically the login form includes a text input box for the user ID, a password input box for the password, and a submit button. The form can include other controls, like switches to select different options. It can be a much prettier and more informative page than the pop-up box used with basic authentication.

3.2.1. Designing the Login Form

The ACTION field of the <FORM> directive on the login page should point to a CGI program that will test authentication. It's important that on login forms, the HTML <FORM> directive should specify METHOD="POST". Without this the default "GET" method will be used. This has three problems:

Normally you will use a TYPE="text" input box for user's login and a TYPE="password" input box for the user's password. You can improve security slightly by not giving these input boxes obvious name like "login" and "passwd". If someone is sniffing packets on the net someplace, then they are likely looking for keywords like those.

It is important to make sure that the user's browser does not cache pages that have login/password information on them. If you let the login page be cached, then after the user has logged off and left his console, someone else may be able to use the browser's "back" function to return to his login screen, which will still have his login and password in the text boxes, and hit submit again to resend the same data. Telling the browser not to cache this page prevents that. Many newer browsers automatically erase password boxes when you go back to that screen. This is a nice (though sometimes annoying) security feature, but you can't count on it happening on all browsers.

You can set pages not to be cached by using the following three HTTP response headers:

  Expires: Thu, 01 Jan 1970 00:00:00 GMT
  Pragma: no-cache
  Cache-Control: no-cache
None of these alone is sufficient to convince all browsers not to cache, but the three together work for most browsers. You can also set these via the HTML META tags (e.g. <META HTTP-EQUIV="Cache-Control" CONTENT="no-cache">) but many browsers are flakey about handling these and most proxies ignore them completely. I have a very old page giving more detail on this, which badly needs updating.

While we are on the subject of caching, if any of the pages displayed during the user's session contain really sensitive information, then you should consider setting no-cache on those pages too. If you are using some form of session IDs to track users, then it will be impossible for someone who comes along after the user has logged off to go back and reload the page, because the session ID will have been invalidated after logout.

Preferably the page with the login form should be kept simple, so that users aren't too annoyed by the fact that it has to be reloaded from scratch every time they go back to it. If you want to put lots of other stuff on the same page with the login box, consider putting the login box in a separate frame, and having only that frame uncached, instead of the whole page.

Modern browsers go beyond the mere caching of form data - they provide an autocomplete feature that can remember what you put into a form, and automatically fill in the same values for you again the next time you visit. The browser will typically prompt you to ask if you want the login and password to a site saved. This is a considerable convenience, and an obvious security problem. Anyone who can access that person's machine can also log in as him. Sites with important data to secure should probably discourage this by adding an "autocomplete=off" option to the login form, like this:

  <form action="login.cgi" method="post" autocomplete="off">
This will cause most (but not all) browsers that do autocompletion of passwords not to do it for this login form. For some browsers this behavior can be further discouraged by varying the URL of the page containing the form (perhaps adding random extra path information) or by varying the field names (so the login field name might be a random string starting with "l" and the password field would be a random string starting with "p"). Berkeley's Workstation Support Services has a good page about this.

3.2.2. Checking the Login

Note that when the user submits the form, the password will be sent over the network in clear text, just as it is in basic authentication. There is really no way around this except to do something like using a Java Applet for the login screen, or to run on top of a secure protocol like SSL.

The CGI program that checks logins should look up the login ID given by the user in some kind of user database and, if there is such a user, check the password supplied by the user against the one stored there.

For security reasons, you should not store the clear text password in the password database. Instead, store the result of running some one-way encryption algorithm (like the Unix crypt() function or MD5) on the password, and check the password by encrypting the one given by the user with the same algorithm and comparing the result to the value in the database. You should also include some other information, like the login name or some random salt, in the encryption, so that two users with the same password don't end up with the same encrypted string. Don't try to invent your own encryption algorithm unless you precisely understand the mathematical basis of the standard ones. Amature cryptosystems always stink.

Even if passwords are encrypted, the password database must be protected carefully, because many encryption algorithms can be cracked and all are vulnerable to dictionary attacks. Never store the database under the document root. On Unix systems, it should be readable only to whatever user the CGI process runs as, which should not be user "nobody." (Though it is common, it is not a good idea to run httpd as "nobody". Several standard programs, like fingerd run as "nobody" and assume that the "nobody" account is one with no special privileges. As a general rule, no file should be owned by "nobody". OK, maybe I'm the last person who still believes this.) Preferably the CGI should run as some ID different from the one the httpd runs as, either because it is an suid program, or because something like suExec or cgiwrap is being used. The more specialized the account you use for this is, the better. It could be an account used only by the CGI programs that authenticate out of that database and those that maintain it.

Any program that handles passwords should be written with great paranoia. For example, steps should be taken on Unix systems to ensure that it doesn't leave core dump files when it crashes, because such core dumps contain an image of everything that was in memory, possibly including some password or buffered fragments of the password database the program read. (Normally core dumps files aren't publically readable, but on some Unixes they are - this is a bug.) For added paranoia, overwrite any password you had stored in any variable with zeros, as soon as you are done using it.

In addition to security issues, there are some efficiency issues. If you have a large number of users, some form of hashed user database may be essential to allow good performance.

3.2.3. Reload-Proofing the Next Page

Login complete, you now display the next page to the user. If possible, you will want to arrange things so that reloading this page does not re-execute the login. If you don't then if your user logs off but leaves the browser running, then another person could use the browser's "Back" button to return to the first page after the login and reload it. Reloading it will pop up a "Repost form data?" box, where the "form data" is the login and password which the browser has carefully cached. An earlier version of this document said this could be solved by not caching the first page displayed after the login. That turns out to be false. Even when the page is not cached, the browser remembers the request that resulted in the page, so it can always be reloaded.

If you are tracking users by passing the login and password to each page, then there is no way to fix this. However if you are using some form of session IDs (see below), so that the login and password are submitted only once, then a fix is possible.

The best solution I've found is to make sure that the response to a successful login form submission is always a tiny page containing a HTTP Location redirect header to bounce the user to the real first page. It does not pass the user's password on to the next page, but instead passes along only a session ID either by putting it in the CGI arguments in the URL redirected to, or by setting a cookie. Reloading the page redirected to won't re-execute the login, because the password wasn't passed to it, and the browser won't let you go "Back" to the location redirect page.

However, there is another pitfall here. You need to be sure that you are doing an external redirect, not an internal redirect. Suppose we have a page named "http://hostname/path/oldpage" which sends one of the following redirect headers:

Location: http://hostname/path/newpage     <- Preferred
Location: /path/newpage<- Don't use this!
Location: ../path/newpage<- Questionable.
Location: newpage<- Questionable.
Only the first of these is actually legal under the HTTP/1.1 specifications, but all will generally result in the newpage page being displayed. In the second case, however, it doesn't matter that it isn't legal HTTP, because most HTTP servers never send it to the browser. Instead, they handled it internally. The server just starts the new request and sends the result back to the browser as if it were the response to the original request. Thus the browser's "Location" box will still have oldpage in it instead of newpage, and if you do a reload it does re-execute the login.

The last two cases are normally not trapped by the server, and every browser I've tried treats them exactly like the first one. However they are not technically legal, and I've heard that some older browsers are confused about them, so you should probably avoid them.

The first form of redirect also has some pitfalls however. Most web sites can be accessed at more than one URL. You can usually give the IP address instead of the hostname, and frequently more than one hostname is possible, if only "mycompany.com" and "www.mycompany.com". If you always do the redirect to one particular form of the domain name, then you may be changing the domain name in the user's browser window. Besides some risk of confusing the user, this can have annoying side effects. Any cookies set on the old domain name may not be accessible from the new domain name. Because of this, it is usually desirable to get the domain name used in external redirects from the HTTP_HOST environment variable, which should contain the host name from the user's request. (If your site can be accessed both by https and http you will want to check the HTTPS environment variable to see which prefix to use in the redirect.) However, I've seen at least one site where the HTTP_HOST environment variable was munged up - apparantly rewritten by some proxy to give the LAN address of the host instead of the external IP address. In this miserable circumstance it may be best to use one of the questionable relative urls above.

3.3. Preserving Login Information

Once a user has logged in, and we have confirmed that the login is correct, we need to find a way to preserve that information so that the user doesn't have to log in again to access the next page. This is one of the hardest parts of building an authentication system. In normal HTTP transactions, the server treats each request for a page as an atomic operation, and forgets all about it before processing the next request. The server in general has no way to reliably tell if two successive requests come from the same browser or not.

The server does know the IP address from which a request comes, but there may be many different users running browsers at the same time at the same IP address. It is presumably also possible to forge IP addresses, though I don't know how difficult this is. Worse, there are some some ISPs, notably AOL and WebTV, which use farms of proxy servers. The IP address your receive is the IP address of the proxy, and successive requests from the same user may come to you through different proxies, so for these users IP addresses are not constant within a session. (You can ask the WebTV folks not to do this on your site, but I don't know if AOL is equally accomodating, and I'm sure lots of other ISPs neither you nor I know about do similar things.) Some proxies give you the IP address of the end-user as something other REMOTE_ADDR (variously HTTP_X_FORWARDED_FOR, HTTP_PROXY_USER, or HTTP_FORWARDED), but many do not. The upshot of all this is that IP addresses are not always obtainable, and not always constant, and thus mostly useless as identifiers unless you are willing to support only those users with normal software.

The primary way to track state information, like a user's login status, is thus to store it in the browser. But the browser cannot be trusted. Source code to perfectly good browsers is widely available, and hostile users can easily program a browser to behave in any way they wants it to behave. It's simple to write small C or perl programs that emulate a browser. Thus, any piece of information that is passed to a browser could be tampered with before it is sent back to the server, and must be revalidated after it is sent back to the server.

3.3.1. Methods for Passing Authentication Information

There are two primary methods by which information can be passed from page to page. Either you can include additional parameters on every link or you can pass it in a cookie.

3.3.1.1. Passing information through link parameters

To use link parameters to pass the information we have the currently running CGI script insert the data into every link and form it generates. For example, if we are simply passing the login name and password from page to page, and we had a link like
<A HREF="http://www.host.com/next.cgi">
we could change it to
<A HREF="http://www.host.com/next.cgi?login=joe&password=WhatEver">
Similarly, in any forms whose action is another CGI script where authentication is needed, we could add hidden inputs:
<INPUT TYPE="hidden" NAME="login" VALUE="joe">
<INPUT TYPE="hidden" NAME="password" VALUE="WhatEver">
(Again all such forms should use METHOD=POST to keep this data from being logged and to hide it from the Unix ps command.) Thus each CGI program can get the needed information along with its other parameters and recheck it to decide if the user is legitimate.

The main problem with this is that it has to be done on every single link and form within the site. This can be cumbersome. Also, for HREF links (but not forms), the user's browser will be displaying the parameters on his screen. That's undesirable if that information includes the user's password as in the examples above (we'll see ways to use session IDs to avoid this later).

Another problem is that, as with basic authentication, there is no way to log a user off without exiting the browser. As long as the browser is still running, it might be possible to use the browser's "back" function to return to one of your pages. This will always work, because the browser stores the query parameters (with the authentication information) with the URL for each page in the history list. Using "Pragma: No-cache" headers seems sufficient to keep the browser from caching POST arguments, but I don't think there is a way to delete GET arguments short of exiting the browser. This problem too can be solved by using session IDs.

Finally, if the user bookmarks a page, his authentication information is likely to be included in the bookmark, which is not necessarily desirable.

3.3.1.2. Passing information with cookies

The popular alternative is to use cookies, which are supported by all browsers these days. A cookie can be set in the header of any response that your server sends to the user's browser. A cookie has a name, and some arbitrary content string. In addition, each cookie can have domain and path specifications and an expiration date.

The domain and path specifications tell which web sites the browser will pass the cookie information to. Any time the user accesses another page whose URL says it is on a host matching the domain given in the cookie, and in a file matching the path given in the cookie, the browser will automatically include the cookie in the request for the page. This means that if we place our authentication information into a cookie, and if we set up the path and domain right, then it will automatically be passed back to every other CGI in our web site. This is much easier than having to explicitly insert that information in every link among our pages.

Note that we should be careful how we set the cookie's domain and path, especially if there are other web sites on the same server. Otherwise the user's browser will happily send our authentication data to other web sites, which may or may not be a security problem.

Any cookie can have an expiration date and time attached to it. After this time, the browser will stop sending the cookie out and destroy it. Cookies without expiration dates persist only until the user exits the browser. If a cookie has an expiration set, most browsers will save it in a file on the user's computer. This may be an problem if you put very sensitive information (like the user's password) into the cookie. For authentication purposes, it generally makes more sense just to use cookies without expiration dates, so they will only be kept in memory, and the user will be logged off when the browser exits, just as with basic authentication. Many security conscious users configure their browser to accept only non-persistant cookies - those without expiration dates.

Your site can destroy any cookie it set simply by resending a new value for the same cookie name, but setting an expiration date in the past. This means that if we store our authentication information in a cookie, we do have a way to log off. We just kill the cookie. Even if the "back" function is used by the user to return to a page that was previously accessed with the cookie, it won't work again without logging in again, because the cookie is gone. You still should make sure the login form (with the user's password on it) and the first page shown (which was passed the user's password as a query argument and originally set the cookie) aren't cached, otherwise resubmitting the former or reloading the latter will result in a new cookie being generated.

Like any other part of an HTTP request, cookies are vulnerable to sniffers on the net. The obvious solution is to use a secure lower-level protocol, like SSL, in which everything sent or received is encrypted. This is not an absolute solution however. If the browser ever connects to the same site using the unencrypted HTTP protocol, the cookie will be sent in the clear and will be sniffable. A well-designed site shouldn't contain any such links, but enterprising hackers can usually find a way to send a user to a site. For example, hackers have been known to send users of SSL-encrypted sites forged emails which contain URLs for the site that have an http: prefix instead of an https: prefix. If the user clicks on that link, the cookie will be sent in the clear, allowing the sniffer to catch it. Slipping a http: reference to an image on the target site into another web site would also do it. On the whole, it's probably best not to trust too much in the privacy of cookies, even with SSL connections.

Cookies that the browser sends to the server are passed from the HTTP server to your CGI program in the HTTP_COOKIE environment variable. On Unix systems where the "ps" command can be used to display environment variables, this may allow other users on your system to easily see the contents of cookies, just as they can see parameters to "GET" queries. This can be a big security problem if not all users on your server are trusted. Pretty much the only way around this problem is have the cookies processed inside the HTTP server daemon and have the server censor them before passing them to the CGI, but this requires a modified HTTP server.

Cookies also suffer from bad press. Whether or not there is any justification at all for the poor reputation of cookies, it should be clear that using cookies instead of query parameters in authentication applications is only an implementation difference - both methods maintain the exactly same data. So there is no rational reason for the user to worry more about one than the other. Yet if you use cookies, there are still some users who will worry about privacy violations and some who will avoid your site entirely. Because of cookie-related security holes like this, some users routinely surf with cookies disabled (though disabling javascript is probably the more sensible solution). Whether justified or not, this fear of cookies is a factor that must be weighed when choosing an authentication method.

Some users disable cookies in their browsers. Obviously this would cripple your authentication system. Other users set their browsers to display an alert box each time a cookie is set. This would be bad if you put passwords in the cookies, because then you are displaying the user's password on the screen at un-predictable intervals. Not nice if other people are looking over the user's shoulder.

Another problem arises if the user wants to log in as two different users at once. This is mainly common with administrators, who may want to use their personal account and their admin account at the same time in different browser windows. In many cases, cookies will be shared between all browser windows, so can't have two different values at the same time. CGI parameters can be kept separate in different browser windows. On the other hand, in a web site that uses frames heavily, it can be useful that authentication information set by logging in in one frame is available to all frames.

Regardless of whether we use cookies or link parameters, we gain an efficiency advantage over basic authentication. With basic authentication, each request had to be done twice, since the authentication information isn't sent on the first attempt. But with both these methods the authentication information is always sent by the browser on the first request. This can reduce the overhead associated with authentication. Well, not really. Most browsers these days seem to be pretty smart about anticipating when to send basic authentication information (actually, I've seem some who seemed slightly promiscuous about it, sending it to different authenticaton realms on the same server).

3.3.2. Methods for Encoding Authentication Information

In the examples above, we simply passed the login and password that the first CGI page received from the user on to later CGI pages. Each new execution of a CGI program would have to repeat the same authentication checks as the original CGI page did, looking the user up in the user database and comparing the passwords.

This has several important disadvantages. We are sending the user's clear text password around the net pretty liberally, which can never be a good idea. There are risks of the password popping up on the user's screen or being logged in various places, all of which can compromise the security of the user's password. Also looking up the user in the password database every time a page is accessed can become quite expensive if you have a large number of users and thus a large password database.

Two basic alternatives are widely used. Either we can encrypt the data that we send back to the user, or we can save the data in a local database and give the user only a pointer to that information. We'll consider each of these methods in detail.

3.3.2.1. Using encrypted user IDs

Suppose the CGI login program has authenticated a user and wishes to create a certificate by which the user can be identified more easily in the future. One way to construct such a certificate would be to concatenate together the following data and encrypt the resulting string. We encrypt the resulting string using a secret key we store on the server, using some two-way algorithm like DES (Note: some systems use some clever home-brew encryption methods of the author's own invention, like XORing the data with a constant string. Don't do this. Unless you are up to date on all the latest cryptographic research and have a brain the size of a watermellon, you are far better off using one of the established algorithms for which library functions are readily available.) This will give some mysterious string of garbage characters. We send this to the user's browser either as a cookie or a link parameter.

When the user accesses the next page, the certificate will be passed in as a link parameter. We decrypt it using our secret key. The result should be just what we originally put in. We check that the constant string is still there, that the IP address matches the one the new request came from (actually this could cause problems with AOL and WebTV users, as noted above), and possibly that the time stamp isn't too far in the past. If everything looks OK, we accept the login name from the decrypted certificate as the identity of the user.

This has the the advantage of being very fast. Decrypting such small chunks of data doesn't take much time. After the original login, we never again have to look the user up in any databases.

Having an expiration time on certificates is important for several reasons. For one, suppose you delete a user account. That user might still be able to continue connecting with a certificate that you issued while that user had an account even after the account was gone. If the same login name was then given to another user, the first user could log in to the new user's account with the old certificate. Clearly the things must have expiration dates.

Or course, legitimate users might then find their logins expiring during long sessions. Your site should probably be designed to offer an apologetic screen that gives them a chance to log in again and continue where they left off.

There is always some danger that some snooper on the net will see one of your user's certificates and try using it themselves. It would be somewhat difficult to do so, because the certificate includes the original owner's IP address, so the snooper would have to arrange to appear to connect from the same IP address. Furthermore, a stolen certificate would only be usable for a limited time because the time-stamp in it would eventually expire. Thus sending these kinds of certificates over the net is somewhat less of a security problem than sending clear-text passwords. Of course, if we are sending a clear-text password over the net during the original authentication, then the snooper presumably has that too, so the whole question of the security of the encrypted certificate is moot. Still the expiration and IP-address tricks might slow down villains who merely saw the certificate string pop up as a cookie alert on another user's screen.

Another big concern is that someone might forge a certificate. With the scheme described above, this should be hard to do, because you'd need to know our secret key to be able to make certificates that would look valid to us. Algorithms like DES are designed to make it extremely difficult to figure out the key with which a message was encrypted, even if you know the contents of the message. Difficult doesn't mean impossible, of course. The NSA could probably crack your DES keys in their sleep. But it's good enough for most purposes (and there are plenty of other algorithms that are more secure than DES).

Note that with this scheme, the security of your secret key is very important. It's unpleasant if someone figures out one of your user's passwords, but that compromises only one user. If someone gets your secret key, then they can forge certificates identifying themselves as any user on your system. In fact, since we don't look the login names up after decrypting them, they could forge certificates identifying them as users that don't even exist on your system.

It's important that the secret key be long and strange. A brute force crack would be to get a certificate and then try encrypting the data with all possible secret keys until you got one that matched the certificate. A fast modern computer can run through an awful lot of guesses at the secret key very fast. So the secret key should be long (at least, say, 256 characters and it should not be some simple text string, but rather a hunk of random garbage culled from a pool rich in randomness).

It is probably worthwhile to change your secret key regularly. Of course, every time you do so, all users with current legitimate certificates would suddenly find that they don't work any more, so they would have to re-login. If you do this regularly, it can even be used as an alternative method to force certificates to expire.

However, although changing your secret key regularly might stop someone from cracking it (since by the time they had it cracked, it might no longer be valid), it might not help much if they got it by somehow accessing the file where it is stored. If they did that once, they can likely do it again.

This algorithm, as described above, requires the use of a two-way encryption program. This can be a problem because some countries have have restrictions on the export of such algorithms, so distributing software based on two-way encryption can be a problem. As it happens, you can do this equally well using a one-way encryption algorithm like MD5.

To do this, we concatinate together the following:

We compute the checksum of this string with MD5 or a similar algorithm. We place that checksum, together with clear text copies of the user name and the login time in the cookie or in the query parameters.

When the next CGI page receives this information back from the browser, it has not only the MD5 checksum, but all the data that went into originally creating it: it knows the secret key, its realm name, and the IP address the request came from, and the login name and login time are passed to it with the MD5 checksum. So it strings all that data together again and recomputes the checksum. If it matches the checksum given, then the whole set of data must be legitimate, and we can accept the login name and login time that were in the cookie are right. If the login time is too long ago, we still reject it, but otherwise, we can accept the user as authenticated.

This variant of the original algorithm has pretty much the same strengths and weaknesses. It is still fast and reasonably secure, but its security still depends on the security of the secret key.

3.3.2.2. Using session IDs

An obvious variation on the previous method is to save copies of all the certificates we issue, and then check each one that comes in to ensure that it is, in fact, one of the ones that we sent out. This would be a bit slower, because we have to do database lookups for each page. But the certificate database would have only have to have entries for all the users using the system at any given time, not for all registered users. Thus it could be small database, and searching it might be much faster than searching a full user database.

Once we bring such a database into the picture, then it is no longer necessary that the certificate contain any actual information. All it has to be is some mysterous random key that may or may not be in our database. Any information we want to attach to it can be stored in the database instead of in the certificate.

If we do this, the certificate is usually called a "session ID" and the database is usually called a "session database".

After our login CGI program has successfully authenticated a user against the user database, we generate some random string to use as a session ID. It should be quite long, maybe 30 or 40 ascii characters, and it should be different from any other currently active session ID (if it's a random string that long, chances are slim that it isn't unique). We store this, along with the user's login name, login time and IP address in the session database. Ideally this would be a hashed database indexed by the session ID. Then we pass the session ID (only) back to the user's browser in a cookie or in link parameters.

When the next CGI program runs, it gets the session ID back from the browser. It looks that session ID up in the session database. If it isn't there then we reject the login. If it is there, but the expiration date has passed, or the request didn't come from the IP address logged there, we also reject it. Otherwise, we accept the user as being the owner of the login ID stored there. (Note that you may have to skip the IP address check if you want to be able to support AOL users).

To keep the session database from getting large, we should delete expired entries from the database (they don't have to be deleted immediately - it would be good enough to train the program that adds new entries to the session database to overwrite any expired entries before creating any new ones). We should also delete the entry of any user who logs off. This enables us to log off users even if we are passing authentication information in link parameters - the user may still have a session ID, but if it isn't in our database anymore, it isn't worth anything. If you are careful about killing all session entries belonging to a user when you kill an account, you also solve the problem of having valid authentication certificates hanging around after the account has been deleted.

Expiration is thus less important as a security feature, though more important as a performance feature, than it was with encrypted keys. It is reasonable to update a user's time stamp in the session database on each transaction, so that the session expires only if the user does nothing for a few hours.

If you are passing session ID's around in cgi-parameters instead of cookies, then after a session ID expires, it expires separately in all windows. The user would typically have to re-login in each currently open window. Also if he uses his back button to go back before the re-login, he will be back in a document with the old session ID and any links he follows forward again will require a re-login. Kind of annoying. A nice trick is to try to re-use the same session ID the user had before when he logs in again after his session expires. Then the session IDs in all his windows and cached pages become valid again. With cookies this isn't an issue, since all windows use the same copy of the cookie.

The key security risk is that some villian might get some user's session ID. The villian could then use that key to pretend to be that user on your system for as long as the session lasts (once the session is expired, the session ID couldn't be used again).

One way a villian could get another user's session ID would be by snooping on the net. We are no better protected against this than with other schemes, but as before, if someone is snooping on the net, and we did the original login with passwords sent in clear text, then presumably the snooper has the user's actual password, so having the session ID too is irrelevant. If a villian observes the session ID in some other way, for example, when a cookie alert pops up on another user's screen, then that ID could be used to steal the owner's current session, if the villian can also manage to come in from the same IP address, but the villian won't be able to use it to start future sessions (unless he changes the user's password during the current session - which is why any password reset form should ask for the user's old password).

If the string is 30 or 40 random characters, then a person would not be likely to be able to get another user's session ID simply by guessing it. The chances of guessing the user's actual password are probably better. Since a password is a much more valuable item than a session ID, sensible villains would presumably focus their guessing games on passwords, not session IDs.

Another way he could get a user's session ID would be if he found a way to read our session database. Clearly we must be cautious to protect the security of the session database. However, it isn't quite as sensitive as the secret key file in the previously describe method. Getting a one-time copy of the session database would let you impersonate any user currently logged on for as long as the session lasts, but it wouldn't let you pretend to be any user you choose, as the secret key did.

The other way a person might get another user's session ID would be if he understood the algorithm your program uses to generate session IDs well enough to be able to predict what session ID numbers it will assign to future users. In fact, some badly designed sites simply assign session IDs sequentially to users, making it trivial to guess other user's session IDs. Even if some attempt is made to randomize session IDs, it may still be possible to guess them. Computers are deterministic machines, so the so-called "random number generators" generally available are actually quite predictable.

So how do we generate hard to predict session ID strings? Probably the best choice is to use a good pool of randomness provided by your operating system. Many modern Unixes provide a device called /dev/random and /dev/urandom from which you can just read the number of bytes you need. These pools constantly stirred by low-level device driver activity in the OS, and are thus pretty hard to predict.

However, many OS's do not provide /dev/random, so homebrew methods must often be used. Most packages I've seen claim to build them by taking miscellaneous data, like the current time, the current process ID number, the number of milliseconds it took to read a particular file, and suchlike stuff, and hashing them all together with some algorithm like MD5 to create a random string. Unless the villain knows all those pieces of information, he probably won't be able to predict the session ID. However, a lot of this information isn't so very unpredictable. If you can't guess it exactly, you can guess it within a narrow range. For example, most mailers use the timestamp and process ID to generate message ID's for every email sent, so villians should be able to guess the likely range those values could have had around the time the email was sent.

When using low quality values like this, it's good to use a file or other database to accumulate randomness. This is called an entropy pool. Keep a file with a number in it. Each time you generate a session ID, read the file, add the current system time (or other values) to it, and write it out. Include that value in your MD5 hash. Then your session ID depends not only on the time of the current login session, but on the times of all past login sessions. Much harder to predict. Unless, of course, the villian can read your file.

In extreme cases, I've done things to dig large numbers of transient data out of the process table and kernel state, using an suid-root program. This tends not to be very portable. A simpler method and more portable method is just to run a command like "ps -avxww" and feed the output through MD5. This gives a value that depends on the current state of every process currently running on the system - pretty tough to guess, especially on a nice busy server. However, especially on a busy server, a "ps" command can be rather slow, and the options that give the most lively output differ in different implementations of "ps".

A tempting idea is to include the user's password (clear-text or encrypted, whichever is handy) in the hash for the session ID, along with some other data (at least the date and time, to ensure that the user gets a different session ID on each login). With this, to guess the session ID that will be given to a particular user, you need to know his password. But if you know his password, then you can login legitimately, and don't need to guess his session ID.

The risk, of course, is that someone might decrypt the session ID and get the password. This is only interesting if he got the session ID in some way other than snooping on the net (because if he was snooping on the net, he presumably saw your clear-text pasword when you originally logged in, so he doesn't need to decrypt a cookie to get it). I've never quite had enough confidence in the idea to try it. Note that using a just a few characters out of the plain-text password would be a distinctly bad idea. If your session IDs were made from three letters of the plain text password, plus time and process ID, then the hacker just has to guess the time and pid within some reasonably narrow range (often not too hard) and cycle through maybe 643 possible combinations for the three password characters. Once he's got three letters of the password, guessing the rest may be very easy. It's bad to give crackers a chance to attack passwords piecewise. (Using a couple characters from the users encrypted password, however, would be substantially safer.)

A fringe benefit of using a session database is that it can potentially store other information that you'd like to remember about the user through the session in the same database, like what page he last looked at, and so forth. This kind of information can also be stored in cookies sent to the user's browser, but cookies are limited in size, subject to tampering, and annoy some users.

4. Conclusions

On a server that can only be accessed by trusted users, or has no way for untrusted users to spy on environment variables, I usually use session ID databases.

For servers with untrusted users who can spy on environment variables, this only works if session IDs are passed as CGI variables, not cookies, and where all queries are POSTs, not GETs. This is too restrictive for most applications, so I usually just stick with Basic Authentication.

I've been thinking of trying to build an Apache module that would convert all GET queries aimed at CGIs into POST queries, and also convert all cookies into CGI variables, so that if the browser sent a cookie named SID, it would turn into a cgi variable named cookie_SID with same value and be passed in with the other POST data. This would never pass any interesting info to the CGI via environment variables, thus keeping all requests secure. However, I've been thinking about it for many years and stil haven't done it.

5. Acknowledgements

Thanks to the many deviously paranoid programmers who have pointed out security pitfalls I never thought of. Several of the issues raised here were first pointed out to me by Marcus Watts. Thanks to Stewart Rap for ideas on the naming of fields in login forms.
Last update: Tue Oct 7 22:06:18 EDT 2003