Azure Cosmos DB Cross Partition Scan

The last project at work, I had a good opportunity to play Azure Cosmos DB closely, there is something I learned particularly about cross partition scan, and I think it is worth noting it down.

We all know cross partition scan is less efficient, and it is expensive, but what exactly happens under the hood, so to answer that question I proxied through my application using Fiddler.

I have to switch to Https mode which is less performance comparing to Direct TCP model, so that Fiddler can capture all network traffic, one of response headers indicates which physical partition request was operated against. After retried many times with different query on a collection that has 10 physical partitions (and for simplicity I left the MaxDegreeOfParallism not set), it shows that the SDK always starts its searching at partition 0, if the result doesn’t meet searching criteria then it sends out another one to partition 1, and so on..until it finds all or reach the end, this means it will never have an even distribution RU consumption cross partition because partition 0 will get hit the most, partition 1 gets hit second most, and so on. Depends on where the data is located if data is all located in partition 0 then I would be lucky (cost most effective although still uneven distribution), but if they are located at partition 9, the SDK eventually will have to send out 10 requests in total to find it, and that is roughly 20–30 RUs.

So what could go wrong with an uneven distribution? It is not costs effective, because the RU consumption allowance pre-configured for a Azure Cosmos DB is the total consumption, which means each partition gets an equal portion with a simple formula

RU consumption per partition = Total RU consumption / Number of partitions

Let’s say you pre-configured the above collection to have 50K RU consumption then each partition will be capped at 50K/10=5K, so even the partition 9 barely gets hit it still takes a big chunk of 5K allowance which completely goes wasted, and the most hit partition 0 will most likely be overloaded under peak load and as a result it will return 429 — Too Many Request error, however, it might not be bad as it looks like, as the SDK has a built-in retry mechanism that will retry the failed query internally, so your application consumer might not see any difference even your Azure Cosmos DB metric shows 429 errors.

If you have MaxDegreeOfParallism enabled (set to a positive number) the SDK will, instead of check-and-see, it then does pre-fetch by sending x (the value of the property) number of requests at a time to all partitions regardless where your data is located, and aggregates all responses internally, and this is done in a single call to ExecuteNextAsync() method which is a bit confusing so I raised an issue on GitHub for clarification.

Above explains why cross partition scan is expensive and why you should avoid it as much as possible, but can we avoid it? In my humble opinion it can be challenging, it would require carefully picking your partition key and design your application to work with it, if your application is a API it would likely rely on its consumers to pass on partition key to you, so it can be a cooperative design challenge.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store