Docker issues with our gateway

Accipiter · April 26, 2025, 11:01am

Any docker/containerd experts in the community?

We previously had issues with Mucklet account due to a gateway continuously beings restarted. I resolved that by recreating that docker container.

Now, the same issue has occurred on the gateway for our Test server (which runs the under the same docker daemon). Since the test server is not critical, I thought I’d keep it in this broken state for now, trying to solve the issue behind it.

What I know:

Docker tries to restart the container every minute.
The container exists with code 137 (SIGKILL)
OOMKilled is false, so it is not due to Out Of Memory

When running journalctl -f -u docker I get the following logs every time the container is started:

Apr 26 07:36:48 mucklet.com dockerd[24915]: time="2025-04-26T07:36:48.420583841+01:00" level=info msg="ignoring event" container=15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 26 07:36:49 mucklet.com dockerd[24915]: time="2025-04-26T07:36:49.233116221+01:00" level=warning msg="failed to delete container from containerd" container=15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc error="task must be stopped before deletion: running: failed precondition"
Apr 26 07:36:51 mucklet.com dockerd[24915]: time="2025-04-26T07:36:51.234370764+01:00" level=error msg="stream copy error: read /proc/self/fd/49: file already closed"
Apr 26 07:36:51 mucklet.com dockerd[24915]: time="2025-04-26T07:36:51.234431294+01:00" level=error msg="stream copy error: read /proc/self/fd/135: file already closed"
Apr 26 07:36:51 mucklet.com dockerd[24915]: time="2025-04-26T07:36:51.236881572+01:00" level=error msg="15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc cleanup: failed to delete container from containerd: cannot delete running task 15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc: failed precondition"

The container logs (docker logs <container>) shows that it starts without issues:

realm_test-resgate-1  | 2025/04/26 06:36:49.222363 [INF] Starting resgate version 1.8.0
realm_test-resgate-1  | 2025/04/26 06:36:49.222576 [INF] Connecting to NATS at nats://nats:4222
realm_test-resgate-1  | 2025/04/26 06:36:49.231129 [INF] Listening on http://0.0.0.0:8080
realm_test-resgate-1  | 2025/04/26 06:36:49.231185 [INF] Server ready

Any suggestions?

Vernon_Otter · April 26, 2025, 4:54pm

I am certainly not an ‘expert’, but I have been using Docker and Linux for a while…

Reading through some things, it is recommended to run docker stats to monitor. It could be that instead of a memory issue, it is a high rate of IO that is getting it killed off. (Though for the test server, this is unlikely.) But the logs indicate that it’s trying to delete the container but failing because some of the connections it is using are already closed. Those closed connections could be what’s killing the container. Can you check back in journalctl to the few minutes prior for other messages to see if those file descriptors are present?

I am also seeing references to systems possibly limiting the number of processes, which can cause a 137 kill… The stats command should hopefully help to understand the utilization issues and either identify or eliminate this as a possiblity.

Accipiter · April 26, 2025, 6:32pm

Thanks Vernon!

This, with the combination of help from @farcaller, have taken me closer to the issue.

From journalctl I can see I have errors that align that seems to point with some issue with IPv6:

Apr 26 14:35:26 mucklet.com kernel: overlayfs: upperdir is in-use by another mount, accessing files from both mounts will result in undefined behavior.
Apr 26 14:35:26 mucklet.com kernel: overlayfs: workdir is in-use by another mount, accessing files from both mounts will result in undefined behavior.
Apr 26 14:35:26 mucklet.com kernel: br-13ffa2d6c79a: port 6(vethe85bfca) entered blocking state
Apr 26 14:35:26 mucklet.com kernel: br-13ffa2d6c79a: port 6(vethe85bfca) entered disabled state
Apr 26 14:35:26 mucklet.com kernel: device vethe85bfca entered promiscuous mode
Apr 26 14:35:26 mucklet.com kernel: IPv6: ADDRCONF(NETDEV_UP): vethe85bfca: link is not ready
Apr 26 14:35:26 mucklet.com systemd-udevd[28590]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 26 14:35:26 mucklet.com systemd-udevd[28589]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 26 14:35:26 mucklet.com systemd-udevd[28590]: Could not generate persistent MAC address for vethe85bfca: No such file or directory
Apr 26 14:35:26 mucklet.com systemd-udevd[28589]: Could not generate persistent MAC address for veth3aee2f7: No such file or directory
Apr 26 14:35:26 mucklet.com systemd-udevd[28589]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 26 14:35:26 mucklet.com systemd-udevd[28589]: Could not generate persistent MAC address for veth3c6b4a2: No such file or directory
Apr 26 14:35:26 mucklet.com kernel: br-92386a63d38e: port 17(vethfc7046b) entered blocking state
Apr 26 14:35:26 mucklet.com kernel: br-92386a63d38e: port 17(vethfc7046b) entered disabled state
Apr 26 14:35:26 mucklet.com kernel: device vethfc7046b entered promiscuous mode
Apr 26 14:35:26 mucklet.com kernel: IPv6: ADDRCONF(NETDEV_UP): vethfc7046b: link is not ready
Apr 26 14:35:26 mucklet.com kernel: br-92386a63d38e: port 17(vethfc7046b) entered blocking state
Apr 26 14:35:26 mucklet.com kernel: br-92386a63d38e: port 17(vethfc7046b) entered forwarding state
Apr 26 14:35:26 mucklet.com systemd-udevd[28601]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 26 14:35:26 mucklet.com systemd-udevd[28601]: Could not generate persistent MAC address for vethfc7046b: No such file or directory
Apr 26 14:35:26 mucklet.com containerd[16409]: time="2025-04-26T14:35:26.472533507+01:00" level=info msg="shim disconnected" id=15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc
Apr 26 14:35:26 mucklet.com containerd[16409]: time="2025-04-26T14:35:26.472668542+01:00" level=warning msg="cleaning up after shim disconnected" id=15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc namespace=moby
Apr 26 14:35:26 mucklet.com containerd[16409]: time="2025-04-26T14:35:26.472690982+01:00" level=info msg="cleaning up dead shim"
Apr 26 14:35:26 mucklet.com dockerd[24915]: time="2025-04-26T14:35:26.473577970+01:00" level=info msg="ignoring event" container=15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc module=libcontainerd namespace=moby topic=/
Apr 26 14:35:26 mucklet.com containerd[16409]: time="2025-04-26T14:35:26.504725708+01:00" level=warning msg="cleanup warnings time=\"2025-04-26T14:35:26+01:00\" level=info msg=\"starting signal loop\" namespace=moby pid=28625 runtime=io.
Apr 26 14:35:26 mucklet.com containerd[16409]: time="2025-04-26T14:35:26.548274634+01:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Apr 26 14:35:26 mucklet.com containerd[16409]: time="2025-04-26T14:35:26.548495476+01:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Apr 26 14:35:26 mucklet.com containerd[16409]: time="2025-04-26T14:35:26.548542754+01:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Apr 26 14:35:26 mucklet.com containerd[16409]: time="2025-04-26T14:35:26.549328937+01:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/15ffd18c6fa69790585509ce740f5591c6dd4c
Apr 26 14:35:26 mucklet.com kernel: eth0: renamed from veth3c6b4a2
Apr 26 14:35:26 mucklet.com kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethfc7046b: link becomes ready
Apr 26 14:35:27 mucklet.com kernel: eth1: renamed from veth3aee2f7
Apr 26 14:35:27 mucklet.com kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethe85bfca: link becomes ready
Apr 26 14:35:27 mucklet.com kernel: br-13ffa2d6c79a: port 6(vethe85bfca) entered blocking state
Apr 26 14:35:27 mucklet.com kernel: br-13ffa2d6c79a: port 6(vethe85bfca) entered forwarding state
Apr 26 14:35:27 mucklet.com dockerd[24915]: time="2025-04-26T14:35:27.185357897+01:00" level=warning msg="failed to delete container from containerd" container=15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc error="task
Apr 26 14:35:29 mucklet.com dockerd[24915]: time="2025-04-26T14:35:29.186507978+01:00" level=error msg="stream copy error: read /proc/self/fd/136: file already closed"
Apr 26 14:35:29 mucklet.com dockerd[24915]: time="2025-04-26T14:35:29.186576922+01:00" level=error msg="stream copy error: read /proc/self/fd/153: file already closed"
Apr 26 14:35:29 mucklet.com dockerd[24915]: time="2025-04-26T14:35:29.189276685+01:00" level=error msg="15ffd18c6fa69790585509ce740f5591c6dd4c6608f5661ff179d98e7b2c9adc cleanup: failed to delete container from containerd: cannot delete r
Apr 26 14:35:29 mucklet.com kernel: br-92386a63d38e: port 17(vethfc7046b) entered disabled state
Apr 26 14:35:29 mucklet.com kernel: veth3c6b4a2: renamed from eth0
Apr 26 14:35:29 mucklet.com systemd-udevd[28735]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 26 14:35:29 mucklet.com kernel: br-92386a63d38e: port 17(vethfc7046b) entered disabled state
Apr 26 14:35:29 mucklet.com kernel: device vethfc7046b left promiscuous mode
Apr 26 14:35:29 mucklet.com kernel: br-92386a63d38e: port 17(vethfc7046b) entered disabled state
Apr 26 14:35:29 mucklet.com systemd-udevd[28735]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 26 14:35:29 mucklet.com systemd-udevd[28735]: link_config: could not get ethtool features for veth3c6b4a2
Apr 26 14:35:29 mucklet.com systemd-udevd[28735]: Could not set offload features of veth3c6b4a2: No such device
Apr 26 14:35:29 mucklet.com kernel: br-13ffa2d6c79a: port 6(vethe85bfca) entered disabled state
Apr 26 14:35:29 mucklet.com kernel: veth3aee2f7: renamed from eth1
Apr 26 14:35:29 mucklet.com systemd-udevd[28749]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 26 14:35:29 mucklet.com kernel: br-13ffa2d6c79a: port 6(vethe85bfca) entered disabled state
Apr 26 14:35:29 mucklet.com kernel: device vethe85bfca left promiscuous mode
Apr 26 14:35:29 mucklet.com kernel: br-13ffa2d6c79a: port 6(vethe85bfca) entered disabled state
Apr 26 14:35:29 mucklet.com systemd-udevd[28749]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 26 14:35:29 mucklet.com systemd-udevd[28749]: link_config: could not get ethtool features for veth3aee2f7
Apr 26 14:35:29 mucklet.com systemd-udevd[28749]: Could not set offload features of veth3aee2f7: No such device

(hehe, promiscious)

But in that case, then it most likely a server configuration thing (rather than my code), which might be resolved once we move to a dedicated server than than a VPS.
So, perhaps I won’t even have to figure it out, only recreate the container when needed?

Vernon_Otter · April 26, 2025, 11:41pm

Looking at the log, I’m seeing the very end of the container shutdown and beginning of the restart. It might be good to look back a few seconds or a minute with output filtered to containerd and dockerd to see what might have caused it. You’re getting warnings and errors about the docker container cleaning up, but it’s not saying anything about the beginning of the container shutdown.

Toma · April 27, 2025, 2:50am

This kind of thing is much easier to troubleshoot with a shell (or on a call with a screenshare); would be happy to pair up and look closer if that’s useful.

My brain is fried right now, so apologies if below thoughts are disorganized—but would be happy to talk more about this when fresher.

Are those logs from the first time the issue occurred (the first restart before the container started flapping)?

Some of this looks like it could be containerd in a weird state, is there any sign that the process might have died or restarted before the symptoms started occurring? (I’ve read some old issues around containerd dying/being killed resulting in screwiness.)

Could you share some information about the VPS environment, such as:

What OS the VM is running:
cat /etc/*release

Docker version:
docker version

It would be useful to identify the timestamp of when the container first started flailing, and then look for any other logs prior to that timestamp. (Even if the timestamp is not useful to us, knowing the first occurrence will help you with log-diving.)

Also, was any maintenance performed prior to the symptoms? (Whether human interaction— or any crons or automated upgrades?)

If you have any monitoring or dashboards around the VPS, that may be helpful to review as well. (Or indication from your hosting provider that there were issues with your hypervisor, VM migrated to another node, etc.)

I haven’t looked at kernel logs (during container start/stop) recently enough to know if the network-device-related logs feel abnormal. The network devices referenced there are likely what are brought up and down for the container’s namespace & the bridge to host network.

May also be worth just looking at the state of the host’s network interfaces, like ifconfig -a or equivalent (depending on the environment).

Not sure if this could be an issue with the host’s networking, the network device management, or if that’s a red herring.

If you want, could also enable debug logs on the docker daemon and see if you get anything more interesting.

Anyway… A really huge journalctl log going back from before the symptoms started would remove doubt around if we’re looking at incomplete snippets

Edit:

Now, the same issue has occurred on the gateway for our Test server (which runs the under the same docker daemon)

Oh, are you saying that it’s the same VPS/infrastructure, and only the container for the test server is affected?

Accipiter · April 27, 2025, 7:06am

Thanks for the help, Toma! (and welcome to the forum )

I will dig deeper into your suggestions and the logs later when I have more time.
But for now, just a quick answer to your last question:

Yes. Same. And it is just 1 container of the test server - the resgate API gateway - that is acting like this. The test server in itself also consists of 15 other services/containers, handling separate things such as files, in-game mail, reports, etc.

Resgate, the API gateway (Docker hub: resgate) is a simple single process service that runs a web server, and bridges communication between services and clients.

And it seems to be killed when it tries to to start up the http-server. I am not 100% sure, but when it logs its last log:

[INF] Server ready

… then it just about to start listening to the http port for requests. So, I think that is when it is killed.

However, when recreating the container:

docker compose up -d --force-recreate resgate

… then the issue is gone. It is now running as usual.
So, as @farcaller suspected, it seems to be a network issue.

Toma · April 27, 2025, 8:50pm

Thanks for the added info, I agree that @farcaller’s thought (about network-related issues) might track.

I am assuming that you have a restart policy configured for this container in your docker-compose.yml, do you have a health check as well?

It’s probably safe to assume that the container restarting is a result of the restart policy (finding that the container is unhealthy/dead, and bringing it back up), and it seems that this type of restart does not resolve problems with the container (hence the restart loop).

(Forgive the potentially obvious statements, am rubber-ducky-ing.)

You note that restarting the container using --force-recreate has resolved the issue in the past, so we could assume that (part of) the problem space exists somewhere in the contrast between:

when the container comes up automatically (via restart policy)
when --force-recreate is used to restart the container

--force-recreate would delete and re-created additional resources than the restart policy’s behavior. So if our problem is with the docker/container network, virtual network devices, iptables rules, etc getting into a bad state— perhaps the forced recreate is replacing these, while the restart policy is re-using them.

However, this doesn’t help explain how the network resources (specific to only one container) would wind up in a broken state. Maybe this could stem from an issue with the VPS/host networking or config— but if the host manages both Test and Production resgate containers, and they are not both breaking at the same time, it’s gotta be a little more nuanced.

Other tidbits of info that might be helpful for understanding:

docker-compose.yml configuration (restart policy, network-related config)
docker-compose --version

And info about the docker networks:

docker network ls
docker network inspect <network relevant to resgate>
(It may also be useful to compare to the network of the production resgate’s network.)

And then info about the network interfaces (ifconfig -a or ip addr show or ip links show)

It would be interesting to see what appears in the docker network details versus the host’s existing (virtual) network interfaces. If we see a mis-match there, that could support the theory that the network config ends up in a bad state (no longer reflecting the host’s interfaces) which restart policy fails to recover from.

docker inspect <resgate container name> could be useful too (and contains other network-related output).

If your VPS is from a fairly standard cloud provider (and a VM with a similar image/OS/config could be stamped out), I’d be willing to spin something up (as similar as possible) and try to reproduce the issue (with just resgate, NATS, etc. in a similar docker-compose). Is it easy to stamp out something similar (like infra/config-as-code), or is this like a “pet” VM (lots of manual config changes over time)?

(Since Test is on shared/same infra as Prod, you’re more limited with how much you can do to deliberately try to break or change things for chasing down the root cause… am sure I’m preaching to the choir.)

Edit:

And yeah, I don’t think this is a problem with resgate’s code. The container logs look healthy to me. I think it’s fair (from a how-much-effort standpoint) to shrug and see if the issue is still a problem on a different infra, later… Especially if the issue rarely affects production and you’re okay with the break-fix.