Turns out URL canonicalization has benefits beyond SEO. Last week, I spent some time analyzing our IIS logs to help track down a performance issue. What I found surprised me. Our CDN (Akamai) didn't appear to be working properly.
How many requests should you expect from Akamai?
Akamai's architecture is pretty straightforward*. You make a request for msnbc.com content that's cached by Akamai (say http://www.msnbc.msn.com/id/42324795/ns/world_news-asiapacific/). Your request goes to one of Akamai's edge servers. If the content is already cached at the edge server and hasn't expired, it returns the content to you. If not, the request goes to one of Akamai's midgress servers. If the content is cached at the midgress server and hasn't expired, it returns the content you want to the edge server, which then hands it off to you. Otherwise, the request goes all the way to the msnbc.com data center (and is subsequently returned to the midgress server which returns it to the edge server).
Based on this architecture, there is a theoretical upper bound to the number of requests that I should see at my origin servers for a given piece of content. Akamai won't reveal how many midgress servers they have, but it's safe to assume it's somewhere between 50 and 100. For the time being, let's a assume it's 75. Also, assume your cache TTL is measured in minutes. For a 24 hour period, the maximum number of requests our servers see should be:
Max requests at origin = 1440 minutes per day / TTL * 70 midgress servers
If your TTL is 1 minute, you could see up to 100K requests per day for a given piece of content.
What were we seeing?
For some content, I was seeing 5x the number of requests we should have been seeing. Before I lost my sanity asking our operations team for the 10th time to verify the Akamai configuration, one of my colleagues made an interesting observation (in this case, making an interesting observation is a euphemism for pointing out my folly).
My analysis was based on content ID. If you looked at requests per URL (which is how Akamai caches things), there were 4 major variations and dozens of minor variations of the URL. The two variations at the top of the list? The trailing slash:
Problems like these have a two part fix:
1. Issue a 301 redirect for any URL that isn't the canonical URL
2. Consolidate your URL generation code - this is challenging if you're working on a sprawling legacy codebase
*I've simplified the explanation a bit. First, Akamai employs a pre-fetching mechanism which can make your TTL seem much shorter. Also, I suspect Akamai actually serves stale content from the midgress and edge servers instead of blocking the end user request waiting for the stale content to get bumped out of cache.