Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax error on network annotation read failure #3771

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

apostasie
Copy link
Contributor

Fix #3765

Killing or stopping a container that does not have a nerdctl network annotation is currently a hard fail.

Suggesting we change that to a soft error, warning the user that they are (very likely) doing something that will lead to trouble (eg: managing containers with different tools).

@mathias-ioki I you can test this patch and confirm it fixes your issue, that would be really helpful (I am reading tea-leaves here).

@apostasie
Copy link
Contributor Author

Windows failure likely unrelated.

@mathias-ioki
Copy link

Hey @apostasie,

thank you again for the really quick patch. You definitely found the right place - so I can confirm, that the patch is working and the signal is send to the container.

$ ./nerdctl kill -s HUP f87d93283532f
WARN[0000] Unable to read network annotation: this container was probably not started with nerdctl.No networking cleanup will be performed, which may likely result in a broken state for the other systems you used to manage these containers.Mixing completely different stacks to manage containers lifecycle is not recommended.  error="unexpected end of JSON input"
f87d93283532f6ba2eddf7def116166b75a49abd896a20c6a6704c04dd6f1b74

I get the point of the warning message, but from what I understand, I don't fully agree to it. In the README of this repository, debugging Kubernetes is a described use case of nerdctl. So I wouldn't expect a warning, if I actually do this. Sending a kill signal doesn't really change anything to the container deployment itself. Especially as this was working fine with releases <2.0.0.

@apostasie
Copy link
Contributor Author

apostasie commented Dec 17, 2024

Hey @apostasie,

thank you again for the really quick patch. You definitely found the right place - so I can confirm, that the patch is working and the signal is send to the container.

$ ./nerdctl kill -s HUP f87d93283532f
WARN[0000] Unable to read network annotation: this container was probably not started with nerdctl.No networking cleanup will be performed, which may likely result in a broken state for the other systems you used to manage these containers.Mixing completely different stacks to manage containers lifecycle is not recommended.  error="unexpected end of JSON input"
f87d93283532f6ba2eddf7def116166b75a49abd896a20c6a6704c04dd6f1b74

I get the point of the warning message, but from what I understand, I don't fully agree to it. In the README of this repository, debugging Kubernetes is a described use case of nerdctl. So I wouldn't expect a warning, if I actually do this. Sending a kill signal doesn't really change anything to the container deployment itself.
Especially as this was working fine with releases <2.0.0.

TBH, I do not quite care what we say here :-).
My only gripe is that I want to make sure newbies messing up with a mix of docker / cri / nerdctl get a heads-up that things may not end-up consistent (eg: in scenarios where containers may have been started with --rm for example, which means stopping the container would destroy it, possibly leaving behind un-collected networking conf).

I'll leave it to maintainers to decide what is the appropriate wording - if any.

Cheers.

@AkihiroSuda @djdongjin @fahedouch at your convenience.

@apostasie
Copy link
Contributor Author

apostasie commented Dec 17, 2024

@jandubois moving the nerdctl part of the conversation here.

When I run it from a terminal, the stop seems to work, but the warning is definitely concerning, especially the "unexpected end of JSON input" part:

$ nerdctl -n k8s.io stop 08f71719c7c6
time="2024-12-17T13:16:46-08:00" level=warning msg="Unable to read network annotation: this container was probably not started with nerdctl.No networking cleanup will be performed, which may likely result in a broken state for the other systems you used to manage these containers.Mixing completely different stacks to manage containers lifecycle is not recommended." error="unexpected end of JSON input"
08f71719c7c6

We can / should definitely come up with a better, less alarming message.

What about: INFO: nerdctl was unable to read networking information for this container (err was: XXXX), which is likely managed by another tool

@apostasie
Copy link
Contributor Author

@apostasie I think it would be best if nerdctl could take an extra option on stop, kill, rm etc commands that says: "I know you did not create these containers, but please clean them up for me anyways". That way you can get rid of the ugly warning as well, or turn them into real errors when the user didn't provide the --i-know-what-i-am-asking option.

I have no serious suggestions for the option name, unfortunately. --allow-foreign-container or something like that?

I am not thrilled about yet another flag.
Would massaging the text and downgrading it to an info make things better for you?

On a further note (and this should be discussed in the nerdctl repo), but should nerdctl ps list containers that is cannot manipulate? Or should they be filtered out too, unless you specify the --allow-foreign-containers option? Otherwise it seems a bit inconsistent.

Interesting question (scope would be much larger than this specific issue).
I am more in favor of making sure nerdctl can manipulate other containers - though we cannot provide guarantees that other tools would be happy about the result of destructive operations.

But once stopped, I still cannot remove the container:

$ nerdctl -n k8s.io rm 08f71719c7c6
FATA[0000] 1 errors:
failed to load container networking options from specs: unexpected end of JSON input
Error: exit status 1

$ nerdctl -n k8s.io rm -f 08f71719c7c6
ERRO[0000] 1 errors:
failed to load container networking options from specs: unexpected end of JSON input

Definitely the same "issue" - for another command (rm).

I'll look into it and augment this PR.

@apostasie apostasie marked this pull request as draft December 17, 2024 22:38
@apostasie
Copy link
Contributor Author

@jandubois latest push does:

  • soften the wording of the message
  • downgrade from warn to info
  • address the same class of issue for rm (note: rm was not working either in v1.7 - for windows)

I just tested it against a container from which I removed the networks label, and it worked, albeit with some additional warnings.

I am not 100% sure what would happen in the case of a container fully created outside of nerdctl, so, let me know if you still face (other) issues on remove.

@apostasie apostasie marked this pull request as ready for review December 18, 2024 03:05
@apostasie
Copy link
Contributor Author

@AkihiroSuda could you poke the CI?
Failures are the sempiternal "layer not found"

@jandubois
Copy link
Contributor

jandubois commented Dec 18, 2024

@apostasie Thank you for continuing to work on this. I would still argue that any message that includes the string error="unexpected end of JSON input" is too alarming, even if you prefix it with a nice friendly INFO:

$ nerdctl -n k8s.io stop ca97b953b4ea
INFO[0000] unable to retrieve networking information for that container, which is likely managed by another tool  error="unexpected end of JSON input"
ca97b953b4ea

I think the "unexpected end of JSON input" is an internal error that has been handled by generating a warning, so should not be included in the warning message at all.

As for the GUI displaying an error dialog, I guess it should check for the nerdctl exit code, and just discard the STDERR output if the exit code is still 0.

Deleting the container works, but is still too noisy:

$ nerdctl -n k8s.io rm ca97b953b4ea
INFO[0000] unable to retrieve networking information for that container, which is likely managed by another tool: "ca97b953b4ea607ced637d55bc590fbc7daebc8e17dffd6e03313f57b2c9ba43"  error="unexpected end of JSON input"
WARN[0000] failed to remove hosts file for container "ca97b953b4ea607ced637d55bc590fbc7daebc8e17dffd6e03313f57b2c9ba43"  error="hosts-store error\nnot found\nstat /var/lib/nerdctl/dbb19c5e/etchosts/k8s.io/ca97b953b4ea607ced637d55bc590fbc7daebc8e17dffd6e03313f57b2c9ba43: no such file or directory"
ca97b953b4ea

The INFO message has again the scary bit about the invalid JSON input, and I think the WARN has too much detail for the user. Once you have acknowledged that the container might be managed by a different tool, it is not really surprising that the hosts file isn't there.

Anyways, the exit code is still 0, so the GUI can avoid showing another ugly popup, and I guess for the CLI this is ok.

I do not want to see the error="unexpected end of JSON input" part though, as that sounds like a catastrophic failure, and not something that is somewhat expected, but just unusual.

@jandubois
Copy link
Contributor

Just a thought, and I'm not even sure if I support it myself, but you could add a special rule for the k8s.io namespace to suppress the additional info/warning messages because you pretty much know that the containers are managed by Kubernetes.

It feels a little bit too magical to have a special case hard-coded in nerdctl, but it does make sense from a usability point of view. Every user who specifies -n k8s.io should already be aware that the containers are managed by kubelet and were not created by nerdctl.

@jandubois
Copy link
Contributor

jandubois commented Dec 18, 2024

On a further note (…), but should nerdctl ps list containers that it cannot manipulate? Or should they be filtered out too, unless you specify the --allow-foreign-containers option? Otherwise it seems a bit inconsistent.

Just to clarify: I was not actually asking that nerdctl shouldn't list those containers. I was just making the argument that it is internally inconsistent to list containers with ps but refuse to work with them in the stop, kill, rm etc commands.

Also: all of this is a regression from nerdctl 1.x, which was dealing with Kubernetes containers just fine (stopping, deleting, etc.), and without any additional warnings too.

@AkihiroSuda
Copy link
Member

By default everything should be just shown, but there can be a flag to change the behavior

Similar discussion:

@apostasie
Copy link
Contributor Author

@apostasie Thank you for continuing to work on this. I would still argue that any message that includes the string error="unexpected end of JSON input" is too alarming, even if you prefix it with a nice friendly INFO:

$ nerdctl -n k8s.io stop ca97b953b4ea
INFO[0000] unable to retrieve networking information for that container, which is likely managed by another tool  error="unexpected end of JSON input"
ca97b953b4ea

I think the "unexpected end of JSON input" is an internal error that has been handled by generating a warning, so should not be included in the warning message at all.

I disagree.
It is important context information about the reason why network cleanup could not occur, and is clearly presented as such.

As for the GUI displaying an error dialog, I guess it should check for the nerdctl exit code, and just discard the STDERR output if the exit code is still 0.

Again not familiar with Rancher, but yeah.

Deleting the container works, but is still too noisy:

$ nerdctl -n k8s.io rm ca97b953b4ea
INFO[0000] unable to retrieve networking information for that container, which is likely managed by another tool: "ca97b953b4ea607ced637d55bc590fbc7daebc8e17dffd6e03313f57b2c9ba43"  error="unexpected end of JSON input"
WARN[0000] failed to remove hosts file for container "ca97b953b4ea607ced637d55bc590fbc7daebc8e17dffd6e03313f57b2c9ba43"  error="hosts-store error\nnot found\nstat /var/lib/nerdctl/dbb19c5e/etchosts/k8s.io/ca97b953b4ea607ced637d55bc590fbc7daebc8e17dffd6e03313f57b2c9ba43: no such file or directory"
ca97b953b4ea

The INFO message has again the scary bit about the invalid JSON input, and I think the WARN has too much detail for the user. Once you have acknowledged that the container might be managed by a different tool, it is not really surprising that the hosts file isn't there.

Anyways, the exit code is still 0, so the GUI can avoid showing another ugly popup, and I guess for the CLI this is ok.

I do not want to see the error="unexpected end of JSON input" part though, as that sounds like a catastrophic failure, and not something that is somewhat expected, but just unusual.

We could overall do a better job in many places at providing better, more useful, accurate feedback messages to the user.
This is certainly work in progress.

In that specific case though, I do disagree. This is a command-line tool, for nerds (it is in the name). Getting accurate debugging information of what the error condition is (in that case, a golang library error from the json decoder) is crucial, and I think censoring it because "it could scare some users" would be the wrong decision.

@apostasie
Copy link
Contributor Author

On a further note (…), but should nerdctl ps list containers that it cannot manipulate? Or should they be filtered out too, unless you specify the --allow-foreign-containers option? Otherwise it seems a bit inconsistent.

Just to clarify: I was not actually asking that nerdctl shouldn't list those containers. I was just making the argument that it is internally inconsistent to list containers with ps but refuse to work with them in the stop, kill, rm etc commands.

That was never the intention.

Also: all of this is a regression from nerdctl 1.x

Agreed, and this is what we should fix here.

without any additional warnings too.

That is certainly not true:
https://github.com/containerd/nerdctl/blob/release/1.7/pkg/cmd/container/remove.go#L195

And I don't think this is a reasonable expectation.

@jandubois
Copy link
Contributor

I disagree.

That's fine. To be honest, I don't care too much, as I think it is very much an edge case, and you should not touch containers managed by kubelet, except through the kube API. The only valid use-case is sending a signal to a specific container.

So getting stop and kill working, even with an error message, was the important bit, and you have already done that, so thank you!

I'll leave it to the nerdctl maintainers if/what you want to do about the rest, if anything.

@jandubois
Copy link
Contributor

without any additional warnings too.

That is certainly not true: https://github.com/containerd/nerdctl/blob/release/1.7/pkg/cmd/container/remove.go#L195

And I don't think this is a reasonable expectation.

I'm sorry, I'm not making this up:

$ nerdctl --version
nerdctl version 1.7.7

$ kubectl create deployment nginx-test --image=nginx:stable
deployment.apps/nginx-test created

$ kubectl create service clusterip nginx-test --tcp=5000:80
service/nginx-test created

$ nerdctl -n k8s.io ps -a | grep nginx
3ce45b96147a    docker.io/library/nginx:stable                         "/docker-entrypoint.…"    3 seconds ago     Up                  k8s://default/nginx-test-5c6c4f85c7-6srbc/nginx
867c6b30d5fa    docker.io/rancher/mirrored-pause:3.6                   "/pause"                  24 seconds ago    Up                  k8s://default/nginx-test-5c6c4f85c7-6srbc

$ nerdctl -n k8s.io stop 3ce45b96147a
3ce45b96147a

$ nerdctl -n k8s.io rm 3ce45b96147a
3ce45b96147a

So this may not be a reasonable expectation, but it is how it used to work.

@apostasie
Copy link
Contributor Author

without any additional warnings too.

That is certainly not true: https://github.com/containerd/nerdctl/blob/release/1.7/pkg/cmd/container/remove.go#L195
And I don't think this is a reasonable expectation.

I'm sorry, I'm not making this up:

$ nerdctl --version
nerdctl version 1.7.7

$ kubectl create deployment nginx-test --image=nginx:stable
deployment.apps/nginx-test created

$ kubectl create service clusterip nginx-test --tcp=5000:80
service/nginx-test created

$ nerdctl -n k8s.io ps -a | grep nginx
3ce45b96147a    docker.io/library/nginx:stable                         "/docker-entrypoint.…"    3 seconds ago     Up                  k8s://default/nginx-test-5c6c4f85c7-6srbc/nginx
867c6b30d5fa    docker.io/rancher/mirrored-pause:3.6                   "/pause"                  24 seconds ago    Up                  k8s://default/nginx-test-5c6c4f85c7-6srbc

$ nerdctl -n k8s.io stop 3ce45b96147a
3ce45b96147a

$ nerdctl -n k8s.io rm 3ce45b96147a
3ce45b96147a

So this may not be a reasonable expectation, but it is how it used to work.

Definitely not on windows

@apostasie
Copy link
Contributor Author

apostasie commented Dec 18, 2024

So this may not be a reasonable expectation, but it is how it used to work.

Definitely not on windows

Either way - like you - I don't care much about that part either tbh (with or without warning).

@jandubois
Copy link
Contributor

Definitely not on windows

That may be the case; I assume this is about managing Windows containers then, which seems even more of an edge case than dealing with Kubernetes containers. I've been exclusively talking about managing Linux containers.

Anyways, I just responded because you accused me of lying about this ("That is certainly not true"), and I did not. It worked correctly and cleanly before, and now it doesn't. Maybe it should have always displayed a warning... 🤷

But as I wrote before, while I disagree that the content of the error messages is appropriate, I'll leave this to the maintainers.

@apostasie
Copy link
Contributor Author

Definitely not on windows

That may be the case; I assume this is about managing Windows containers then, which seems even more of an edge case than dealing with Kubernetes containers. I've been exclusively talking about managing Linux containers.

Anyways, I just responded because you accused me of lying about this

Pointing out a mistake - not accusing anyone of anything.

Peace.

@jandubois
Copy link
Contributor

jandubois commented Dec 18, 2024

Getting accurate debugging information of what the error condition is (in that case, a golang library error from the json decoder) is crucial, and I think censoring it because "it could scare some users" would be the wrong decision.

Ok, I promised to shut up now, but I still wanted to explain why I think including error="unexpected end of JSON input" in the error message is not useful:

It is an internal detail that only a nerdctl developer will understand. It does not help anybody who doesn't know how nerdctl is implemented, and it doesn't help them to fix the command to avoid the error. Actually, there is nothing you can do to avoid it beyond redirecting all STDERR output to /dev/null.

The important part of the message is to let the user know they are manipulating a container that nerdctl has not created itself. All the internal errors are meaningless; with the same argument you could also dump a full stacktrace to STDERR at this time.

By including irrelevant information in a warning or error you just risk confusing the user about what is wrong, and what they can do to fix (how is the user supposed to fix the JSON input?).

Information that is only useful to the developer should be at the DEBUG level. Throwing it out at the INFO/WARN level just compromises the user experience for maintainer convenience.

It is not about "scaring" users, but "confusing" them.

Finally, you simply can't say it is just a warning and include error="unexpected end of JSON input" as part of the message, because it implies a full failure to anyone who is not familiar with the implementation.

Sorry for the rambling; just wanted to explain my point of view; not going to argue any further.

@apostasie
Copy link
Contributor Author

Hey @jandubois

Thank you.

Truly appreciate a different opinion here. This is what I like about open-source.
These are interesting and valid points.

However - to explain a bit better why I am generally not inclined to prioritize UX over debugability in the current state of affairs: nerdctl had/has serious issues with debugability overall that has significantly hurt quality in the past.

A lot - most? - of the reports we get from users (or from the CI) are absolutely cryptic, and require black-belt-tea-leaves-reading abilities to even start making sense of (#2676 is one of many examples)
Most users will not even re-engage when asked questions - so, asking them to re-run with debug is usually a fools errand.
Also, transient / flaky errors have been a plague here, and the users (these who would re-engage) might or might not be able to reproduce the actual issue with debug on.
That makes contributors and maintainers alike a lot more likely to just ignore and give up on reports (or CI failures) and let quality slide down.

Information that is only useful to the developer should be at the DEBUG level.

Yes, in a perfect world, with a more mature / robust software, absolutely.

Right now, yes, I would actually be keen to include stack traces on stderr :p. /jk

Enough general philosophy though, and back to the concrete case here:

  • I am interested in fixing the issue here (regression preventing nerdctl from manipulating external containers with kill / stop / rm)
  • I ll let it to maintainers to decide if we should include a message at all - specifically because of what other, actual error condition could trigger this (besides manipulating containers not created by nerdctl) <- this is key here - if the only way to produce it is truly just kube containers, fine, let's drop the message - but if there are other ways to end-up there, legitimate errors with nerdctl producing this, we need to keep it

Thanks again @jandubois
Appreciate the debate.

@apostasie
Copy link
Contributor Author

@AkihiroSuda this PR addresses a 2.0 regression when trying to stop / kill containers started with other tools (specifically k8s).

There is some debate on whether we should print a warning or not in that circumstance (I don't really care either way), but otherwise I believe we should consider merging this soon.

lmk your thoughts.

@AkihiroSuda AkihiroSuda added this to the v2.x.x (tentative) milestone Jan 22, 2025
@@ -143,7 +143,8 @@ func cleanupNetwork(ctx context.Context, container containerd.Container, globalO
networksJSON := spec.Annotations[labels.Networks]
var networks []string
if err := json.Unmarshal([]byte(networksJSON), &networks); err != nil {
return err
log.G(ctx).WithError(err).Infof("unable to retrieve networking information for that container, which is likely managed by another tool")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tool uses the nerdctl/networks label?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None that I know of. So, this is saying that the container does not have the label, and is managed by another cli.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning should be printed on networksJSON == "", not on json.Unmarshal error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other place (in remove) does check on NetworkOptionsFromSpec, which returns an err onjson.Unmarshall - so, if we change it here, we should also change the logic inside NetworkOptionsFromSpec.

I would rather keep this PR minimal in terms of logic changes.
The current logic captures both cases when networksJSON == "" and when networksJSON is non-empty but does not parse.

lmk your thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the label is non-empty, json.Unmarshal error is critical and not negligible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - but then, these operations here are remove and kill (not create, run, etc).
Should we really prevent remove / kill from succeeding if the network label is mangled?

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@apostasie
Copy link
Contributor Author

CI failure is the usual Docker Hub 429 (too many requests).

@AkihiroSuda AkihiroSuda merged commit 01c15a7 into containerd:main Jan 27, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nerdctl kill isn't working after update to 2.x (unable to cleanup network)
4 participants