Handling relative URLs and multiple forward slashes

Apache, Nginx and Microsoft IIS web servers will transparently combine multiple forward slashes into a single forward slash when matching URLs to [file] locations. The advantage being that if a user requests “/docs///page.html” the server can match that to and send back “/docs/page.html” without extra configuration. Unfortunately, today’s web browsers don’t perform the same feat of magic in the same circumstances which creates some bad problems for anyone using relative URLs.

A document that the author expects to be served from “/docs/page.html” can also show up on the address “/docs///page.html” or some other variation with multiple slashes. A relative URL on this page may link to “../product/nuts.html” with the author fully expecting this to be interpreted as “/products/nuts.html” where instead all browsers will interpret it as “/docs//products/nuts.html” and the link will be broken. The problem here is that the author’s expectations are not guaranteed: web servers take it on themselves to do non-standards compliant (as I understand RFC-2396) “smart” matching of locations where multiple forward slashes are interpreted internally a single forward slash. Web browsers stick to standards and don’t do any such smart character deduplication. For example, URLs containing base64 encoded paths may explicitly require support for multiple forward slashes in a row, so browsers can’t just copy the web servers default behavior on this one.

This problem can be tackled by website administrators at two ends: By using canonicalizating redirects on the server and by adding a base element on each of the pages. Lets look at both approaches in turn:

1. Redirect addresses with double-slashes to their single-slash equivalent

Please note that this will only redirect one occurrence of double-slashes. An address path like “/this//fine///example.html” will require three separate redirects to end up at “/this/fine/example.html”. This will slow down the page-loading considerably as it will require four round-trips back and forth between the server and browser before the browser finally starts loading the page. This scenario is not very likely and will only affect a small subset of users.

Quick clarification before proceeding: I’m not talking about protocol-neutral addresses here; URLs that starts with with two slashes instead of specifying the protocol as HTTP or HTTPS. This article only touches on the path section of a URL.

Configuration example for nginx

This code example can be applied on a server or http block.

merge_slashes off;
# replace merge_slashes' behavior with "redirect_slashes"
location ~* "\/\/" {
	rewrite ^(.*)\/\/(.*)$ $1/$2;
	rewrite ^ $uri permanent;

Nginx by default creates a relative mess when you want to redirect double-slashed addresses. Internal location matching is rewritten to merge multiple slashes into a single slash even before you can have a chance to match it and handle it differently. The option “merge_slashes” must be switched from on to off to allow manual overriding. The above example shows how to fully replace merge_slashes’ behavior of transparently merging with redirecting each duplicated slash to a single slash. Don’t disable this option without restoring similar functionality as shown above.

Configuration example for Apache

This code example can be applied on a VirtualServer block or server-wide.

RewriteEngine On
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . %1/%2 [R=301,L]

Redirecting incorrect variants of your page’s addresses to their preferred canonical link also helps ensure any further distribution of links to it will use the preferred address.

If you cannot make changes to the server operation, or don’t want the advantages or overhead of using redirects; you can look into making changes to the page’s links instead:

2. Manually setting the base URL in each document

Far at the back of the HTML toolkit, we can find the base element. You hardly ever see this technique used out on the real web, but it’s widely supported and dead useful. In fact, it’s actually meant to solve this exact problem.

The base element should be placed inside the head element before any elements containing links. Setting it’s href attribute will move the base URL path used to resolve relative URLs on the current page. For example, a page opened as “/docs//page.html” loading its stylesheet from “../assets/trend.css” will be understood by browsers as “/docs/assets/trend.css” preventing the stylesheet from being loaded. Setting the base element to the canonical link, “http://www.example.com/docs/page.html” will cause the browsers to correctly interpret the address as and load “/assets/trend.css”.

<!DOCTYPE html>
    <base href="http://www.example.com/docs/page.html" />

The URL used in the href attribute should match the absolute canonical URL for the current page. If you’re already generating a canonical link element, you already have this address and can reuse it for the base element.

It can be a good idea to include a base element in your pages even if you’re redirecting all double-slashes. The base element will assist user agents make sense of your relative links even if a user downloads a copy of your HTML and opens it up from their own computer later. It can also work as a signal to flag more serious issues on your site: if something breaks after including a base element in all your pages, you were doing something sketchy before and should address that problem properly. Test your links thoroughly after adding a base element for the first time.

If you don’t want to slow down to wait for redirects, you can conditionally use the JavaScript window.history.replaceState API to remove redundant forward slashes from users’ address fields. This is only a make-up job to reduce the chance of the user passing on this address to others and must still be used in combination with the base element.

There are two other aproches that I’ll mention for the sake of completeness:

  1. Use only absolute URLs in the page

The page author has already chosen to use relative URLs, but that could be changed to use absolute URLs for all links. This avoids all problems with relative URLs but require more substantial change to link architecture in the page and possible throughout the website.

  1. Use only root-relative URLs in the page

One-step removed from using absolute URLs, every URL could be changed to start with a forward-slash and be relative from the root instead of the current page. In practise this has the same drawbacks as switching to using absolute URLs. Some clients, especially dumb spiders, may have trouble resolving root-relative URLs in deeply-nested paths.

Leave a Reply

Your email address will not be published. Be courteous and on-topic. Comments are moderated prior to publication.