• RavenDB: Lessons Learned: Query Includes and Projections

    By default, RavenDB will only allow 30 requests per session. This is part of RavenDB's "Safe by default" behaviors, to prevent you from making a giant number of RavenDB HTTP requests, which would be a performance quagmire.

    Let's say you have an object graph that you are retrieving from RavenDB that contains referenced documents, and it looks like this:

    stories/123
    {
      "Headline": "New iPad is key to Apple's bottom line",
      "Author": "Jack Smith",
      "LastPublishedAtUtc": "2012-03-14T23:48:00.0000000+00:00",
      "PublishStatus": "Published",
      "StoryReferences":
      {
        "storyreference/456213":
        {
          "Headline": "New iPhone coming soon",
          "Author": "John Doe"
        }
        "storyreference/789654":
        {
          "Headline": "New iPad foils reviewers' attempts to find legitimate faults",
          "Author": "Jane Doe"
        }
        "storyreference/555111":
        {
          "Headline": "Now on Netflix: Search by TV network",
          "Author": "Jack Smith"
        }
        "storyreference/942342":
        {
          "Headline": "Apple stores to open at 8am for iPad launch",
          "Author": "John Doe"
        }
        ...
      }
    }
    

    And let's say you are interested in getting a small subset of data about the referenced stories for display with the base story. What you DON'T want to do is something like this:

    
    var story = session.Load("stories/123");
    
    foreach(var storyReference in story.RelatedStories)
    {
        var otherStory = session.Load(storyReference.Id);
        // ... do something with otherStory ...
    }
    
    

    That will result in the following HTTP traffic back to Raven:

    1. Make a request for 'story/123'
    2. Make a request for 'story/456213'
    3. Make a request for 'story/789654'
    4. Make a request for 'story/555111'
    5. Make a request for 'story/942342'
    6. ...etc...

    You'll consume unnecessary bandwidth and incur the cost of individual HTTP requests. What you really want to do is have the client make a single HTTP request. Fortunately RavenDB allows you to do that with Includes. A RavenDB include says "Hey server, go get this for me, but before you give it back to me, gather up these other things and return them with the request too so I can deal with them in a moment".

    A few weeks back we had some code that was hitting the 30 requests per session limit. At first we couldn't understand why, since we do a pretty good job of making sure we only make 1 or 2 requests via Includes. Upon further inspection, it turned out we had misunderstood something about the RavenDB client API.

    What's the problem?

    If we have an index that produces projections, in which it produces a server side anonymous entity containing flattened "StoryReferenceIds", like this (this is a contrived example):

    
    public class Stories_ByReferencedStories : AbstractIndexCreationTask
    {
        public class Result
        {
            public string Headline { get; set; }
            public DateTimeOffset? LastPublishedAtUtc { get; set; }
            public IEnumerable StoryReferenceIds { get; set; }
        }
    
        public Stories_ByReferencedStories()
        {
            this.Map = stories => from story in stories
                                  select new
                                  {
                                      Headline = story.Headline,
                                      LastPublishedAtUtc = story.LastPublishedAtUtc,
                                      StoryReferenceIds = story.StoryReferences.Select(x => x.Id),
                                  };
        }
    }
    
    

    ... Then we had previously done something like the following on our Lucene queries against it:

    
    session.Advanced.LuceneQuery()
      .WhereStartsWith("Headline", text)
      .OrderBy("-LastPublishedAtUtc")
      .Include("StoryReferenceIds")
    

    However, it turns out that last Include line doesn't do anything at all. The Include() call actually operates on the entries identified by the index, NOT the projection. In other words, the stories produced from the query are what the Include() call actually operates against.

    So, with that in mind, what we actually want is something like this:

    
    .Include("StoryReferences,Id");
    

    The syntax with the comma may look a little funny, but what it means is "For the StoryReferenceIds entities collection, Include the document identified by the Id property from each referenced document". So if you had a story with 45 referenced stories in it, instead of making 46 requests back to Raven, you would make only 1 request. That's much better.

    Happy coding.

    Show more
  • msnbc.com wins Compuware's "Best of Web" for site performance in 2011

    For the truly curious, you can download the full report here. The basic idea is that Compuware took a look at availability, responsiveness, and consistency to generate a composite score which ranks the top 25 online news sites. The Compuware folks were even more impressed when they learned that we maintained our number 1 position while doing multiple releases per day and while pushing experiences that are, on average, more complex than our competitors. 

    Although I believe pretty strongly in self-deprecation, I'll break with convention for a moment to give props to my team for their contributions to overall site availability and responsiveness. Congrats!

    For insight into what we've done lately in the war against page load times, check out the new experience on http://www.technolog.msnbc.msn.com/technology/technolog.

     

     

     

     

     

     

  • Customizing RavenDB: A simple RavenDB server bundle for replication conflict handling

    Last week in our webinar on RavenDB, we mentioned that we have at least two RavenDB instances in each of our data centers and that we have each of them configured to replicate to all the others.  One attendee asked how we handle replication conflicts.

    First, we try to limit replication conflicts, by having all writes to a given database directed at a single RavenDB instance.  We showed off our UI for managing that, and I have another post in the works describing the details.

    But there are inevitably cases where there is a conflict.  We have some automated processes that update content, so multiple saves in rapid succession are not unusual.  We also have integration challenges with other systems, resulting in our receiving the same document multiple times at more or less the exact same moment.  If our write master changes at an inopportune moment (either due to failover or an intentional change by an admin), we'll likely have a replication conflict.

    As I mentioned in my post on read-only vs. read-write databases, our front end web apps aren't allowed to write to most databases, so they can't resolve the conflict.  In any case, that would probably make things worse, given the number of front-end servers we have.

    Fortunately, there's an easy solution for us.  The nature of our data and our business means we always want the latest version of a document to win.  The Replication bundle provides a hook to allow custom handling of replication conflicts on the RavenDB server.  We simply compare the Last-Modified value of the existing document with that of the inbound replicating document.  Whichever has the later date wins.

    Here's the code for the LastInWinsReplicationConflictResolver: https://gist.github.com/2012016

    Score another one for the simple extensibility provided by RavenDB.

  • Customizing RavenDB: Read-only and read-write document stores

    One of the simple customizations we've made to RavenDB is the ability to specify read-only vs. read-write document stores.  We have editorial and ingest applications that need to write into our RavenDB databases.  However, we don't want anyone -- even ourselves -- to be able to write to most of those databases from the public apps that render our various sites.

    Raven has a lot of extensibility points, and we used one of them to make some document stores read-only from some of our apps.

    The Raven.Client.Listeners namespace contains several interfaces that allow you to hook into any query, save or delete that happens on a given DocumentStore instance.  For this feature, we used IDocumentStoreListener and IDocumentDeleteListener. We implemented both in a ReadOnlyListener class that throws on any attempt to call Store() or Delete().  That happens on the client side, in our app, before any communication with the Raven server.

    Here's the complete code of ReadOnlyListener:

        public class ReadOnlyListener : IDocumentStoreListenerIDocumentDeleteListener
        {
            private const string ErrorMessage = 
                "The store is read-only. To enable writes, use StoreAccessMode.ReadWrite.";
    
            public void AfterStore(string key, object entityInstance, RavenJObject metadata)
            {
                // Do nothing.
            }
    
            public bool BeforeStore(string key, object entityInstance, RavenJObject metadata)
            {
                throw new InvalidOperationException(ErrorMessage);
            }
    
            public void BeforeDelete(string key, object entityInstance, RavenJObject metadata)
            {
                throw new InvalidOperationException(ErrorMessage);
            }
        }

     

    The Before methods are executed by the Raven client, um, before the call to the Raven server.

    We have a class called PlatformDocumentStore that enforces some conventions when creating a DocumentStore. This is a typical call that happens at app startup:

        PlatformDocumentStore.Register(container, StoreName.Content);
    

     

    That line of code creates an instance of DocumentStore pointed at the nearest read-only version of a RavenDB database named Content.  It also registers it in the container (an instance of UnityContainer).  The second parameter is just a string.  We don't want arbitrary stores being created; the StoreName class contains the names of the "allowed" stores.  When we register the store, we register it as IDocumentStore using the store name.  That allows us to inject it into our controllers with Unity's [Dependency] attribute and specify the name of the store.

    But back to read-only vs. read-write...  What you don't see is the optional third parameter.  An editorial app would have this line at app startup:

        PlatformDocumentStore.Register(container, StoreName.Content, StoreAccessMode.ReadWrite);
    

     

    StoreAccessMode defaults to ReadOnly.  That third parameter says it's okay for the application to write to the store.  When we new up the DocumentStore instance inside of PlatformDocumentStore, we have this code:

        if (accessMode == StoreAccessMode.ReadOnly)
        {
            documentStore.RegisterListener((IDocumentStoreListener)new ReadOnlyListener());
            documentStore.RegisterListener((IDocumentDeleteListener)new ReadOnlyListener());
        }
    
    

    With the listener attached for both Store() and Delete() calls, we have effectively prevented any modifications to the data from within this app.

    It's important to note that the database on the Raven server is perfectly happy to accept writes -- we've just hooked into all attempts to use the client-side API from within this one application.  Also, a developer could simply new up a DocumentStore without the listeners and call Store() or Delete() to their heart's content.  This is not a security measure -- it's just a simple way to prevent inadvertent writes where we don't want them.  Like setting the Read-only flag on a file in the file system.

  • How we got started with RavenDB

    Developers always want to use the newest, coolest tools.  Admins want everything to be 100% reliable and stable.   RavenDB was not just a new tool, it was an entirely new kind of tool for us.  Successfully introducing it to our environment required our Ops folks to be comfortable with monitoring and maintaining it.

    After our initial testing in various environments, we talked about it with Ops and deployed RavenDB to production -- without anything actually talking to it.  That was step 1 -- get it deployed.  It was on its own servers and nothing depended on it, but it was out there.  RavenDB was now part of our build and deployment process, so we could update it whenever we needed to.

    Step 2 was to deploy some fairly trivial code, accessible only to internal users, that used RavenDB.  So far so good.

    Step 3 was to turn on replication across the multiple instances of RavenDB.  Here we did find some issues: a few in our own code as we learned about RavenDB; one in RavenDB itself -- a bug involving the combination of replication and DTC-managed transactions.  After a few emails with Oren and Hibernating Rhinos support, a build with a fix was available for download within a couple of days.  We also asked for a new feature -- the ability to disable transitive replication.  That was available in another build within a week.

    Step 4 was to start using RavenDB behind a real feature.  We chose a non-mission-critical feature of our editorial system -- something that, if broken, would only affect our editors, not the public.  We shipped that and learned a bit more.  Bryan posted about his experience creating that feature.

    Step 5 was to ship a feature used by the public.  We did an A/B test of a new user experience for some of our content in late 2011.  That ran on RavenDB.  More recently, in Feb 2012, we fully released that feature:  we transitioned some of our blogs to a new look and feel, from our new RavenDB-backed CMS.  You can see those blogs at:

     

  • Why we looked at RavenDB

    In my previous post, I talked about our initial work in our new "SkyPad" CMS, and some of the tradeoffs we made around data storage and replication.

    As we continued to evolve SkyPad, we realized that the majority of our data isn't inherently relational.  It's almost entirely documents -- serialized object blobs.  That caused us to look at the available options for key-value stores, document stores, distributed hashtables and the like. 

    RavenDB looked like a good fit for a bunch of reasons:

    It is a document database. The basic paradigm is fairly close to our problem domain: saving, updating and publishing documents.  Our editors, developers and admins think and talk in terms of documents.

    It is schema-less.  To keep up with our competition and with what our users expect, we have to be adding and changing features constantly.  We need to add properties to our entities, create new entities, split entities, rename entities.  RavenDB stores documents as JSON.  It de/serializes our types effortlessly, while giving us the tools to take over and do our own thing when necessary.  There is no schema to keep in sync with our code changes.

    It is transactional.  We can't afford to lose a document.  If our database says "I saved it", we need that to be true no matter what.  Our service bus is built with WCF and MSMQ.  RavenDB works with System.Transactions as you would expect, so we didn't have to do anything special to use it during transactional message handling.

    It is .NET-focused and supports querying with Linq.  We're a .NET shop.  Ramp-up speed for new developers and code readability are critical.  Anyone who has used C# and Linq can understand the queries we're making against RavenDB and start creating their own very quickly.  Not to mention we can all read the RavenDB source itself more easily.

    It is extensible.  The list of provided bundles reads almost like our requirements list:  replication, versioning, document expiration, authorization.  Plus we can change RavenDB's behavior where we need to.  (Bundles on the server side, and listeners on the client-side.)  That makes our developers a lot more comfortable.

    It is both open source and commercial.  We can see all the code, submit pull requests when we want something to be different, or just make our own private changes to the codebase.  But we can also pay for it, count on support, and -- ahem -- have a throat to squeeze if something goes seriously awry…

    It is based on very mature storage technology.  RavenDB writes everything to disk, and the underlying storage is ESENT, which is also used by Exchange and Active Directory.  Those underpinnings made our Ops team more comfortable trying it out.  We have a humongous read-write ratio, so the performance characteristics of writing everything to disk are just fine for us.

  • The backstory on data storage in our CMS

    In 2009 we started migrating features from our legacy CMS to a new one, which we refer to as SkyPad.  In previous posts, I've talked about our SOA approach and our service bus.  Here I'll be talking just about data storage.

    We use Sql Server a lot.  It's been supporting most of our systems and running well for 15 years.  It was natural for us to continue using Sql Server as we started to create SkyPad. 

    As we started developing features in SkyPad, we found that almost all of our tables were identical.  We deal with documents  -- stories, videos, slideshows, images, etc. -- and the table structure we were creating over and over again consisted of a document id, an xml blob with the document contents, and some timestamps for auditing and concurrency control. 

    That common table structure led us to develop a simple CRUD repository.  Create, Update and Delete were trivial, as was retrieving by ID.  Querying was harder, since it required either storing the queryable properties separately, or digging into the XML blob.  All doable, but not as simple as we wanted.

    As a major news organization, it is, alas, a newsworthy event if one of our sites goes down even for a few minutes.  So our sites run in multiple datacenters, and our ops team requires complete redundancy within each data center.  We have sets of servers called "pods".  Each is a small, but complete, production environment.  During normal maintenance and deployments (a couple times a day), we can take a pod offline and still be running live out of all the data centers.  That requires our data to be redundant also - each pod has a copy of all the live data.

    Sql Server provides excellent replication and high availability features.  However, we were trying hard to build a system that could be run on commodity hardware and scale out, not up.  We wanted to avoid clustering, SANs and other solutions that require extensive setup, care and feeding by our Ops team.  We wanted teams to be able to quickly spin up a test environment that looked a lot like production.

    For those reasons, we decided to use our service bus to synchronize data across the various databases in multiple data centers.  When a pod is taken offline, data synchronization messages start queuing up.  When the pod is brought back online, the messages are handled and the pod's data is up to date before it starts receiving live requests.

    That works fine, though it has some issues.  To reduce concurrency conflicts, at any given time, one pod is designated as the write-master for each application that is doing writes.  We deploy multiple times a day, and we need to change the write-masters each time.  We need a tool to re-sync pods when the inevitable problems occur or when we bring a new pod online.  These were the tradeoffs we made when deciding to roll our own replication.