Too many requests from the CDN? Try URL canonicalization

Turns out URL canonicalization has benefits beyond SEO. Last week, I spent some time analyzing our IIS logs to help track down a performance issue. What I found surprised me. Our CDN (Akamai) didn't appear to be working properly.

How many requests should you expect from Akamai?

Akamai's architecture is pretty straightforward*. You make a request for msnbc.com content that's cached by Akamai (say http://www.msnbc.msn.com/id/42324795/ns/world_news-asiapacific/). Your request goes to one of Akamai's edge servers. If the content is already cached at the edge server and hasn't expired, it returns the content to you. If not, the request goes to one of Akamai's midgress servers. If the content is cached at the midgress server and hasn't expired, it returns the content you want to the edge server, which then hands it off to you. Otherwise, the request goes all the way to the msnbc.com data center (and is subsequently returned to the midgress server which returns it to the edge server).

Based on this architecture, there is a theoretical upper bound to the number of requests that I should see at my origin servers for a given piece of content. Akamai won't reveal how many midgress servers they have, but it's safe to assume it's somewhere between 50 and 100. For the time being, let's a assume it's 75. Also, assume your cache TTL is measured in minutes. For a 24 hour period, the maximum number of requests our servers see should be:

Max requests at origin = 1440 minutes per day / TTL * 70 midgress servers

If your TTL is 1 minute, you could see up to 100K requests per day for a given piece of content.

What were we seeing?

For some content, I was seeing 5x the number of requests we should have been seeing. Before I lost my sanity asking our operations team for the 10th time to verify the Akamai configuration, one of my colleagues made an interesting observation (in this case, making an interesting observation is a euphemism for pointing out my folly).

My analysis was based on content ID. If you looked at requests per URL (which is how Akamai caches things), there were 4 major variations and dozens of minor variations of the URL. The two variations at the top of the list? The trailing slash:

/id/42324795/ns/world_news-asiapacific/

/id/42324795/ns/world_news-asiapacific

 

Problems like these have a two part fix:

1. Issue a 301 redirect for any URL that isn't the canonical URL

2. Consolidate your URL generation code - this is challenging if you're working on a sprawling legacy codebase

 

*I've simplified the explanation a bit. First, Akamai employs a pre-fetching mechanism which can make your TTL seem much shorter. Also, I suspect Akamai actually serves stale content from the midgress and edge servers instead of blocking the end user request waiting for the stale content to get bumped out of cache.

Discuss this post

That just surprises me to no end that the problem of trailing slash was at fault! That was the last thing I expected you to suggest as a solution! The article was great, and really specific about how Akamai works, and this was the first time I ever heard that term "midgress". (Reminded me of Midgard from Norse mythology). But the trailing slash problem for sub-domains causes no end of trouble in the most unlikely places. I've read Google's support pages about properly formed URL's. Apparently it is not important (except for SEO) whether a domain has the trailing slash appended or not, although it should be. Yet for sub-domains, or pages on domains, it DOES make a difference and requires a 301 redirect.

This isn't Google's doing, correct? It seems more like internet architecture standards, something the W3C or Internet Society created. If so, I wonder why?

    Reply#1 - Mon Apr 11, 2011 6:09 PM EDT
    You're in Easy Mode. If you prefer, you can use XHTML Mode instead.
    As a new user, you may notice a few temporary content restrictions. Click here for more info.