If your reading this then you no doubt encountered the following error running you .NET Core app:
System.IO.IOException: The configured user limit (128) on the number of inotify instances has been reached, or the per-process limit on the number of open file descriptors has been reached.
This is a full application error that results in your host application terminating. Essentially this error is telling us that we are watching too many files… more than the host OS allows for any single user (or application). Some piece or library within our application is consuming more file watchers than we thought.
Turning to our trusty friend “google” for some help, it would seem there are a couple of workarounds being suggested, all depending on where you are encountering the error.
RazorViewEngineOptions.AllowRecompilingViewsOnFileChange = false;
This may work, I never tried it… don’t really use Razor or views in ASP.NET Core much these days with most of the UIs being built as static SPAs. In my case, this was not the issue.
By default, the DOTNET runtime can watch for file changes by receiving notification asynchronously when that happens. An example would be the FileSystemWatcher class. You can adjust this behavior so that when using this type of API, instead of place the file watcher the application can poll for file changes. This change supposedly places a small performance hit and memory increase as a result of polling for the changes.
In either case, I have no custom code or file watching in place for most of my API driven applications.
Unclear to me at first when looking at this was that the error message is actually platform-specific and only occurs on Linux. That makes sense given that “inotify” max users is a Linux specific construct. Additionally, in my case, I never experienced this issue in my local Windows development environment, nor should I expect to.
The local development issues have come up for those with Linux locally. Specifically, some have a workaround for experiencing additional file watchers as a part of just developing and using vscode with .NET Core.
That being said I was able to experience the issue locally on Windows when running the application inside a container. Whether locally on Linux or in a Linux Container you can work around the issue by increasing the number of watchers:
echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p
While I can use the fix above to increase the file watchers inside my container execution, it didn’t seem to get at the root problem. It is a clear workaround.
Another piece of the puzzle is that I experienced this issue after upgrading to .NET Core 2.2 and above (including .NET Core 3.1). Prior to .NET Core 2.2, we had very little application changes and did not experience these intermittent issues at all. Is this an SDK problem of some kind?
This particular issue plagued multiple applications after upgrading, specifically by causing some of the tests to fail during the build. Unit tests would consistently run without issue and integration tests would fail as a result quite often. For the integration test suites where this was observed, a pattern of booting up the entire application in memory using the “TestServer” feature was used. Two key observations can be made based on this behavior:
Given that this issue is now not just an intermittent test issue but rather a PROD stability concern, it is suddenly a much bigger priority to resolve.
Within a few days, we were able to spot an example deployed in our DEV environment where our AWS ECS Container Orchestrator had started one instance but was consistently failing to start a second instance for high availability. The container orchestrator kept trying to place the container on the same host without enough file watchers to allow it to start. The container would subsequently die and another one would try again. This is what we refer to as a “flopping” service stuck in a death loop. The difference is that because the available file watchers varied per host this is very difficult to track down. We likely have many instances where the containers have failed to start and the orchestrator respawns another and it works just fine.
We could begin to scan the logs of our .NET Core services for:
After looking for commonalities on the affected services outside of the .NET Core upgrade that happened, there was only one other similar pattern which these applications shared that is associated to file system watchers. That is the loading of the “appsettings.json” file on the application start. This was the “AddJsonFile” configuration method. It accepts an overload parameter for “reloadOnChange” which reloads the application settings should the appsettings.json file be modified.
configuration = new ConfigurationBuilder() .SetBasePath(Directory.GetCurrentDirectory()) .AddJsonFile("appsettings.json", optional: true, reloadOnChange: false);
The documentation is not great in telling you the default. But looking at the code if you don’t use that overload, the default for Add Json File is “false”. BUT be careful since if you’re relying on the default configuration for your app it sets the reload on change to “true”. Depending on how you expect to reuse your settings in your appsettings.json to read updated values on change, it may have different behavior than you expect.
In the ASP.NET Core world driven inside a container, there is next to zero reasons I can think of in which I would want to enforce reloading of that configuration file. Once my container is built and pushed to its container repository with an immutable version it does not change (only environment variables on deploy modify its behavior). In this straightforward scenario, there is very little reason to turn on “reloadOnChange”. Turning it off, lead to slightly less number of occurrences of this problem.
It looks like the configuration code propagated internally through copypasta from one app to the next without consideration of that particular flag.
The problem was fully resolved after reviewing a common internal library that was used amongst all the services that incorrectly loaded the appsettings.json file multiple times with the reloadOnChange.
Essentially “reloadOnChange” in the application was being set to “true” at 3 different points in the same application to the same file. After disabling this the problem was resolved in its entirety and has not been observed since during integration tests or during deployed runtime.
I still have more questions than answers. Let me know if you have any answers?
The following was added as an update when the issue returned in a dramatic way in mid-May 2020.
Just when you think you have it solved, the same issue strikes again. The problem seems to have been alleviated for a bit, likely due to the drastic reduction of the file watchers in the AddJsonFile method. However, we began launching more .NET Core services into our AWS ECS cluster, and very quickly the issue not only came back but seemed to have gotten much worse. Instead of sporadic once and awhile failures, it was entirely blocking and failing many different unrelated deploys of any .NET Core service. At this point, I decided it best to pull in some more folks from our SRE team to help guide me a bit. Here is a breakdown of where its at:
Profiling and looking at the “inotify” usage in a single container shows .NET Core 3.1 uses about 15 on startup (that is with a default web app, no customizations, no app settings file). That is really funny, and I don’t understand why it needs it. Looking at .NET Core 2.1 for comparison actually showed around 40 “inotify” references on start. This invalidates the original assumption this is getting worse for us as a result of migrating to .NET Core 3. Instead, this is getting worse strictly due to the volume of new services sharing a cluster and ultimately a single host.
# get the inotify related references
apt-get install lsof
lsof | grep inotify | wc -l
// simplest asp.net core configuration//
(ditching IIS and other defaults) still shows//
15 watchers in the container
In this case, the problem is not a ulimit. Instead of the actual watchers, its a problem with the number of “instances”. I’m a Linux newbie, so this doesn’t completely make sense to me, but at least gives me a reason why ulimit modifications are not any help.
The short answer is YES. I never previously reviewed the best practice for specifying a user in your container, but in general, there are a lot of reasons to do this from a security standpoint (i.e. so my container does not run as root). Additionally, though, it would appear that the “inotify” limit we are hitting is not namespaced or grouped by the containers themselves, and so when we specify a user in the container it maps to a different limit / per user outside the container on the host kernel. This exact suggestion was provided in this post here: https://github.com/dotnet/aspnetcore/issues/7531:
It seems like our usage of the same user across all our pods in the cluster was creating this
issue. I rewrote our dockerfiles to have their own unique users and our pods appear to be behaving better (ajamrozek).
I had previously overlooked it given the K8 reference but also given the implication that somehow the limit is spread across the cluster, which is not the case (I misread that). This issue only is a result of the number of containers that are bin-packing on a single host in the cluster. Our multi-stage Dockerfiles with this update look fully like this now:
COPY . ./
RUN dotnet restore
RUN dotnet publish src/Demo.Web -c Release -o /app/out
COPY --from=build ["/app/out", "./"]
# we must use a port above 1024
is using a non-root user for
# using a non-root user is a best practice for
security related execution.
RUN useradd --uid $(shuf -i 2000-65000
-n 1) app
CMD ["dotnet", "Demo.Web.dll"]
The key changes near the end in the second runtime stage. Creating a new user named “app”. In order for us not to run into the same problem again, we need to ensure each container is created with a somewhat random UID for the user id. If all our containers were to switch to the same user we would run into the same limit and same problem.
Careful to review internal practices to ensure that the range of UID you intend to provide does not conflict with a real potential user on the host, as it will then inherit all the permissions of it.
Great now we are back up and running, zero failures… phew. BUT… still so many questions…
While in general, we should have been using USERs within our Dockerfile setup in the first place, at its heart it is still just a workaround in my belief to a more concerning problem. Lots of questions and associated behaviors to be aware of that will require further investigation: