08 October 2010

Detecting URL Rewriting (part 2)

This post is a continuation of my documenting the process I go through to come up with some way a client of a web site can first: determine if URL rewriting is occurring on a given web server, and second: in cases where it is used, determine what the rewrite rules are.

I left off with Apache configured, and a simple rule established for mod_rewrite. I now need to decide whether to use mod_rewrite to handle the rewrite using a redirect (via an HTTP 302 response), or to process it internally. As I mentioned, the difference between these two methods is quite large.

For example, if I choose to send a redirect, (eg. by amending our rule to include an [R] flag), like so ...
RewriteRule    /litterbox/(.*)  /sandbox/$1 [R]
... the rewrite rule will cause an incoming request to http://bar.com/litterbox/bar1.php to be redirected to the location http://bar.com/sandbox/bar1.php instead by using HTTP server headers.

Examining the relevant portion of the HTTP request and response headers associated with this process, the conversation looks like this:

Initial request:
GET /litterbox/bar1.php HTTP/1.1
Host: bar.com

Initial response:
HTTP/1.1 302 Found
Date: Wed, 06 Oct 2010 04:50:18 GMT
Server: Apache/2.2.9 (Debian) PHP/5.2.6-1+lenny9 with Suhosin-Patch
Location: http://bar.com/sandbox/bar1.php

In the response above, notice that the server has returned an HTTP 302 status response, and included a Location: header which contains the URL to the content. The browser receives this, and sends a new request to that location:

Redirected request:
GET /sandbox/bar1.php HTTP/1.1
Host: bar.com

This request is met with the final response, which includes the content at /sandbox/bar1.php:
HTTP/1.1 200 OK
Date: Wed, 06 Oct 2010 04:50:29 GMT

This is how I've used mod_rewrite in the past. The rules I've set to enforce SSL have been very similar to the one given in the example. At first glance, it seems that it will be easy to tell when rewriting is occurring... all that's required is to look for the 302 response!

Not so fast

There are a couple problems with this theory. The first is: there are other mechanisms which can be used to provide this same HTTP response code. For example, the following PHP code will cause an HTTP 302 response to be sent by the server:
<?php
header("Location: http://bar.com/sandbox/bar1.php");
?>

When I put that code into a file located at http://bar.com/redir.php, the response to a GET request for that file looks pretty much exactly like the one generated natively by Apache above:

HTTP/1.1 302 Found
Date: Wed, 06 Oct 2010 05:38:09 GMT
Server: Apache/2.2.9 (Debian) PHP/5.2.6-1+lenny9 with Suhosin-Patch
X-Powered-By: PHP/5.2.6-1+lenny9
Location: http://bar.com/sandbox/bar1.php

From this, it would seem that there is no way to distinguish between a redirect coming from mod_rewrite, and one stemming from some other mechanism.

More importantly though, and a bigger blow to my high hopes for an easy answer, is that the [R] flag is optional. By default, a redirect header isn't returned by Apache at all when mod_rewrite is used. Looking up how Apache handles rewriting, there's a fair amount of documentation on the process specific to the 2.2 version of Apache I'm using:

The nutshell version is this: Requests which are rewritten and not using a 302 response to the client are processed completely within the Apache Kernel only. There's no indication given to the client that a redirect has occurred.

In fact, it appears that the only way an application hosted on the server can know that it has been reached via a rewritten request is by checking for the presence of one or both of two server headers which only appear when Apache has processed a rewrite ... they do not appear on a redirect, despite their name =)

(Recall that I can see these because the PHP script I wrote includes a printout of every server header. It seems that doing this was a good idea indeed!):

REDIRECT_STATUS = 200
REDIRECT_URL = /litterbox/bar1.php

Note that these headers are different than the ones the Apache documentation says it adds. I'm not sure why that is, but since these headers are only available as server variables, they are completely outside the reach of a client accessing a given URL on the host.

That sucks.

At this point, I give up on the 302 response and Location: header theory: it's both misleading (in that a 302 response may not be the result of a URL rewrite), and inconsistent in that rewritten URLs may not provide a 302 response at all.

I start thinking of other mechanisms I could use. One that comes immediately to mind is the Referer header. This is an HTTP header which is provided to a web server when, for example, a user clicks a link. The destination host the link resolves to receives the request for a URL, along with where the user came from. An example of this can be seen here:

Initial Request:
GET /litterbox/bar1.php HTTP/1.1
Host: bar.com

Initial Response:
HTTP/1.1 200 OK
Date: Fri, 08 Oct 2010 05:51:54 GMT
[content]
  <div><a href="bar2.php">bar2</div>
[more-content]

The content served in the response contains a link to bar2.php. When I click that link, the fact that I'm coming from the bar1.php page is sent in the request, as shown below:
Request to bar2.php:
GET /sandbox/bar2.php HTTP/1.1
Host: bar.com
Referer: http://bar.com/litterbox/bar1.php


That's all well and good, but as you can see, the Referer still shows /litterbox as the URL I was coming from. That's because the referer is specified by the user agent (a browser in this case). Since the browser didn't receive any indication that the content it is being served has come from a different location than it requested, it thinks it's still at /litterbox and so sends that location in the headers.

So much for using that as a detection of rewriting. What's next...
So far, I've tried a couple of different ideas to try to determine if a client can tell whether URL rewriting is in use or not. I've ruled out using a 302 response and accompanying Location: header as being unfit for this purpose. I've also briefly played with the idea of using Referer, and quickly ruled that out as an option as well. I need to come up with some more creative way to try to tell.

How about timing?

Thinking about this problem a bit, it occurs to me that, since the Apache kernel has to map rewritten URLs internally to come up with a computed URL to serve content from, that I may be able to use how long a request takes to load as an indicator.

To test this theory out, I'm going to use ruby, because I'm familiar with it, and it allows me to quickly throw together some proof-of-concept code.

Since I have the advantage in this case of knowing for sure what is being rewritten and what is not, I can use the benchmark module in ruby to measure the time it takes to get a file where rewriting is occurring, and where it is not. I can then compare the two to see if this theory bears further investigation.

For the intital test, I decide to use the bmbm method of the benchmark module for two reasons: 1) it automatically gives me two iterations to compare. But more importantly it 2) initializes the environment and tries to minimize skewed results by going through a rehearsal process before benchmarking "for reals". Once I decided that, I came up with the following script:

#!/usr/bin/env ruby
require 'net/http'
require 'uri'
require 'benchmark'
include Benchmark

bmbm do |test|
  test.report("rewrite:") do
    Net::HTTP.get_response URI.parse('http://bar.com/litterbox/bar1.php')
  end
  test.report("non-rewrite:") do
    Net::HTTP.get_response URI.parse('http://bar.com/sandbox/bar1.php')
  end
end

I've created two labels in this benchmark: one for the known rewritten URL, and one for the known non-rewritten URL. When I run this script, I get the following results:

Rehearsal ------------------------------------------------
rewrite:       0.010000   0.000000   0.010000 (  0.001429)
non-rewrite:   0.000000   0.000000   0.000000 (  0.000876)
--------------------------------------- total: 0.010000sec

                   user     system      total        real
rewrite:       0.000000   0.000000   0.000000 (  0.001105)
non-rewrite:   0.000000   0.000000   0.000000 (  0.000907)

That's pretty interesting! When I run this on the same host the web server is located at, I can definitely tell a difference between rewritten and non-rewritten content!

I need to look into this further. The first thing that needs to happen is, I need to perform these requests many more times and look at the timing. A single request is useful for a quick "is there merit to this", but the fact that it appears this may work could just be a fluke in the given requests at that particular time. I need to increase the number of times I perform this test and prove whether, statistically, there is a difference in the time it takes to serve a rewritten URL vs a non-rewritten one.

I also need to look at what factors may affect the results. Some immediate considerations that come to mind are:
  1. is the Apache server cacheing content, causing it to be served faster the second time?
  2. Am I able to prevent that if so?
  3. On a local machine, this may work, but what happens across a LAN?
  4. What happens to the timing when requests go across the Internet?
  5. How much does "heavy" content (video, images, etc.) affect the timing?
  6. Can I time just getting the HTTP headers, to avoid loading content?

I need to answer some of these before testing, and some of these will be answered as the testing progresses.

[to be continued]

2 comments:

  1. I'm not sure your timing test would be valid for larger websites. Usually there would be some kind of application load balancer (probably doing some rewrite work as well) as well as caching going on.

    Possibly a decent heuristic for small direct sites though.

    ReplyDelete
  2. good point about the load balancing, that's not one of the items i have listed above, but is something that definitely needs to be taken into account.

    that said, i don't think that URL rewriting can be detected with any real degree of accuracy via timing or any other method i've been able to come up with (there's actually at least one - two more posts of things i've tried and code...it's taking me longer to blog them than to test them).

    These posts are really more just a documentation of the process, and the resulting failures, rather than any "i found a way to do it" =)

    ReplyDelete