01 October 2010

Detecting URL Rewriting (part 1)

[edit 2010-10-02]: i realized after replying to cdman's comment that i had neglected to include the goals of this project in this post, but had included them in this one instead. I've edited the beginning here to include the first part of that post.

As I mentioned earlier: I’ve been pondering URL rewriting for the past couple of days - trying to come up with some way a client of a web site can first: determine if URL rewriting is occurring on a given web server, and second: in cases where it is used, determine what the rewrite rules are.

I started this process by doing some homework to learn more about how URL rewriting occurs. I’ve used Apache’s mod_rewrite in the past to accomplish some basic tasks like redirecting incoming http:// requests to their https:// counterpart to enforce SSL usage, but I had never done much beyond that.

I decided (as I often do) that the best way to learn was to play. To determine whether URL rewriting is in use, and to try to map the rules, means that I need to have a portion of a web site that is using URL rewriting, and one that is not (so I can compare the two). I further need to have some rewrite rules. Coming up with a random set of rules is difficult, so I gave myself what was, in my mind, a likely scenario:

The Bar, Inc. marketing dept. has realized that their ‘litterbox’ product line has a name which creates a negative impression. It’s decided that ‘sandbox’ is a much better brand for the products. Of course, with the rebranding, the web site has to be updated, it simply won’t do to have links going to bar.com/litterbox/ now that the name has changed.

Begrudgingly, the developers of the Bar, Inc. website put in a ton of overtime to change all the links in the code. Then someone realizes that all the Bar, Inc. customers and business partners also have links that are going to break. The developers can’t do anything about that, it’s outside their control. It now falls to the sysadmin to make sure that no critical third party links get broken.

As the sysadmin, my task is simple: take any requests for /litterbox/whatever and have them go to /sandbox/whatever instead.


Excellent! I now have an interesting story to keep me from getting bored. (OK, fine… interesting is subjective ;-)

More importantly, the fictitious set of requirements dictated in the scenario means that I have a framework established for how to approach setting up this research project.

That means it’s time to get to work.

Preparing The Environment


To get this set up in a way that meets the criteria of the scenario, I first need to have a website. I have a Linux box handy, so I decide to do my testing using Apache. The specific version and OS I’m using is Apache 2.2.9 on Debian Linux, with the Suhosin Patch. In other words, I’m using the default apache2 (mpm-prefork) package on Debian 'lenny'.

I create a directory named sandbox in the Apache web root (which is /var/www on Debian). I then create 4 files in that directory: bar1.php, bar2.php, bar3.php, and bar4.php. Next I edit each of these files to contain some generic code similar to the following, (changing the title and h1 tags to correspond to the file name):
<head>
<title>bar1</title>
</head>
<body>
<h1>bar1</h1>
<div><a href="bar1.php">bar1</div>
<div><a href="bar2.php">bar2</div>
<div><a href="bar3.php">bar3</div>
<div><a href="bar4.php">bar4</div>
<hr />
<?php
foreach($_SERVER as $key_name => $key_value) {
print $key_name . " = " . $key_value . "<br>";
}
?>
</body>
</html>


The PHP code in these files simply spits out the HTTP Server headers key/value pairs to the page. This may prove useful to review, so I'm including it in each page.

Now that I have the Bar, Inc. "website" in place it’s time to contemplate how to proceed – I have at least three four options:
Edit 2010-10-04: I'd neglected to consider the Apache Alias directive. I've added that to the list.
  1. I can enable the SymLinks option and create a link from litterbox to sandbox.
  2. I can use mod_rewrite to change requests for litterbox to sandbox.
  3. I can use mod_rewrite to send an HTTP 302 response redirecting requests to the new location.
  4. I can use the Apache Alias directive to redirect requests to litterbox to a specific path on the file system

After considering these for a bit, I decide that leaving a bunch of stale links lying around the directory tree is a BadThing. For similar reasons, I decide not to use the Alias directive, so that future sysadmins don't become confused. Accordingly, I select mod_rewrite as the way to go. (Thankfully, since that’s the whole point of this project ;-)

Setting up mod_rewrite


The first thing I need is for the mod_rewrite module to be loaded in the Apache configuration. How this occurs varies based on the installation of Apache. In Debian it’s extremely simple to accomplish this task, a single command (and later, a reload of the Apache server) will suffice:
# a2enmod rewrite


Now that the module is enabled, I need to define some rules. This can be done by editing the configuration file that defines the web site. In Debian, this means editing the file /etc/apache2/sites-available/<site-name>. Because I’m just using the default configuration, I place my changes in /etc/apache2/sites-available/default.

The syntax for mod_rewrite can be quite complex, and there are some very powerful features that it provides. However, the scenario I set for myself dictates what I need to establish as far as the rewrite rules… that is, I need to change "litterbox" to "sandbox". Configuring this in Apache is easy enough, it looks like this:
RewriteEngine on
RewriteRule    /litterbox/(.*)  /sandbox/$1


The first line turns on the RewriteEngine. The second one establishes that I want to replace any instance of "/litterbox/" followed by one or more characters, with "/sandbox/" followed by whatever other characters were present when the request came in.

That single line should accomplish the goal of my scenario, however I still have one choice left to make: I need to decide whether I should use mod_rewrite to accomplish this task via an HTTP redirect, or to rewrite the requests.

The difference between these two is not trivial.
Before I go any further, I need to gain a better understanding of how URL rewriting works in Apache.


[to be continued]

2 comments:

  1. It has been some time since I last configured mod_rewrite, but shouldn't the rule be:

    RewriteRule ^/litterbox(.*) /sandbox$1

    ?

    Otherwise it may match /sandbox/litterbox/... (which you might want to do, but most probably not). Also, it wont't match /litterbox which is a valid URL.

    ReplyDelete
  2. You are completely correct cdman, thanks for pointing that out!

    In the case of this particular example, I deliberately left the rule more permissive than it should be. The reason for that is the second stated goal for this project: to try to ascertain what the rewrite rules are.

    My thought process is this:
    Leaving the rule as inclusive as possible (for example, allowing the match on /sandbox/litterbox), is going to be useful once I get to the "what rules are in place" point.

    As far as I can tell at the moment, figuring out the rule set is going to involve making a large amount of requests with slight variations and observing the results.

    Having an open ended rule allows me the freedom to create quite a few different request scenarios without having to revisit the mod_rewrite configuration throughout the process.

    As I develop my process and begin refining it based on what I learn, I'll be adding more "correct" rules, to verify whether my thought process is holding true or flawed.

    I'm not sure if that's sound research theory to be honest; but it's how I'm approaching this project regardless =)

    ReplyDelete