Recently we were alerted that one of our apps hosted in Azure had been very slow, and one of the observations by the awesome Azure Availability and Performance tool was high outbound TCP connections detected, the graph below shows the app went up and down regularly at that time, and it went even higher especially, during app start-up.
Also, this chart shows the SNAT port usage by the app was crazy high too, it was almost at the edge of exhaustion.
What is SNAT
According to this
Source network address translation (SNAT) rewrites the IP address of the backend to the public IP address of load balancer. It enables IP masquerading of the backend instance to prevent outside sources from having a direct address to the backend instances
Each instance on the Azure App service is initially given a pre-allocated number of 128 SNAT ports. The SNAT port limit affects opening connections to the same address and port combination
When applications or functions rapidly open a new connection, they can quickly exhaust their pre-allocated quota of the 128 ports. They are then blocked until a new SNAT port becomes available. The SNAT port exhaustion can cause app performance issue when it reaches the cap,
And thanks again to the Azure Availability and Performance tool, which shows the top 5 remote endpoints to which there were maximum outbound connections, and that leads my investigation to the Azure Key Vault and our Azure SQL server.
Azure Key Vault
The Azure Key Vault is a cloud service that provides a secure store for secrets, and we use the Microsoft.Extensions.Configuration.AzureKeyVault package that has an API that can pull down secrets in app start-up and maps to appsettings.json, it extracts always sensitive information from configure file, is really nice and neat.
However, the way how it loads all secrets are done in parallel (thanks to PR #944), which was initially requested by the community to improve performance since then people start getting the same experience.
After a deep dive into the implementation, notice it uses
Task.WhenAll to retrieve all secrets of that secret page at once, so the underlying
HttpClient doesn’t have enough time to reuse the port, instead, it opens up new connections rapidly even though the combination of the remote IP and port is the same, and because the Keyvault account the app connects to has over 180 secrets in total, and that makes it even worse :)
I wrote a simple app to demo this and the following is a screenshot that shows the impact
var client = new HttpClient();
// Create a array of 200 that has same url
var sameDomainUrls = Enumerable.Repeat("https://www.google.co.uk/", 200).ToArray();
var tasks = urls.Select(url =>
var httpRequest = new HttpRequestMessage();
httpRequest.Method = new HttpMethod("GET");
httpRequest.RequestUri = new System.Uri(url);
Azure SQL server
The every 30 minutes SNAT port exhaustion shown in the chart earlier above actually was introduced by massive TCP connections established to the Azure SQL server. By design, our app refreshes its in-memory cache every 30 minutes by querying the remote database at intervals in order to keep the cache up-to-date, plus there is a separate N+1 problem that the app fans out N subqueries in parallel using
Task.WhenAll(tasks) based on the result returned from the first query.
Although it is not using HttpClient and I am not a SQL expert, but I guess the same principle applies to ADO.Net because all tasks run in parallel so the SQL connection pool doesn’t have time to reuse the connection in the pool instead it establishes new SQL connections rapidly at once.
Microsoft already has a fix for the SNAT port exhaustion issue caused by the obsolete Azure key vault package, it is suggested to change it to use the new Azure.Extensions.AspNetCore.Configuration.Secret package that limits secret loading parallelism.
Although the N+1 selection problem probably indicates a separate underlying flaw in design but to fix the high TCP connection issue, I reviewed the implementation of the app where
Task.WhenAll is used and reviewed if parallelism is needed. Replacing it with a traditional less fancy
foreach loop that does
await call to query database has really helped.
The following chart shows the difference after the fix applied
I used to really love
Task.WhenAll with a passion, whenever I see an opportunity I would change the code to use it and commit it as “refactoring” as a comment :). It is still a great API for parallelism don’t get me wrong but now I understand a bit better about the underlying impact and will use it with caution.