[Samba] Bad SMB2 (sign_algo_id=1) signature for message?

Thu Oct 5 21:58:45 UTC 2023

Ah, wonderful! So I'm not crazy!

Good day, Michael, Jeremy, and other Samba list members. My name is Jeff Saxe, and I'm an IT staff member at Quantitative Investment Management in Charlottesville, Virginia, US. I have some more information to contribute on this issue. I hope that this email adds on to Michael's previous email from Feb. 8th 2023. I was actually reading your message through the list's Archives web site, not having previously been subscribed to this list, and I'm only subscribing now; so I cannot make my email client add that "In-Reply-To" header that might assist the automatic threading.

We, too, have been struggling with this "Bad SMB2 signature for message" random problem, except in our case, the clients that are experiencing the issue are not Windows at all, but other Linux machines running the CIFS-mount client in their kernels. The messages in our /var/log/samba/log.CLIENT.IP.ADDRESS.HERE are exactly the same as what Michael is quoting — "Bad SMB2 (sign_slgo_id=1) signature for message", mentioning that exact same line number 722 of smb2_signing.c. If I "tail -F" that log file, I can see these same messages repeating every 2 seconds, far more often than Michael sees from his Windows clients. So whatever the Linux client is doing, it is extremely persistent in retrying, and it never succeeds but also doesn't give up (doesn't time out or stop doing it for minutes or even hours).

This is all Ubuntu 20.04LTS, both the Samba server and the clients. The file server currently has version 4.15 of Samba; "apt-cache policy" says it could get version 4.11.6 from focal/main, but this was overridden by 4.15.13 from focal-security/main or focal-updates/main. We have about 10 client machines, which don't have the full samba package installed (they don't need to be CIFS servers, only clients), so they have the cifs-utils package, version 6.9-1, from which userland CLI "mount-cifs -V" says it is version 6.9. And their kernel, which I believe has the actual protocol implementation, happens to be 5.4.0-153. These machines are used simultaneously by several end users, and they all have "autofs" mounts that can mount and unmount at any time, using Kerberos credentials so that each user ends up getting access to shared files on the server under his or her security context. So far all of this works just great, and the vast majority of the time, the end users are very happy.

But occasionally, randomly (unfortunately I cannot recreate this issue on demand), one of the users has a persistent failing share mount, such that from their side, they see "Permission denied" and cannot list or change-directory into the spot where it's mounted. The problem doesn't seem to be with autofs itself, although I can't guarantee that; I think it would still happen if the mounts were manual (a human typing mount.cifs at a shell prompt) or were in /etc/fstab (mounting once every time the client machine boots). And it does not affect all the users on the machine, nor does it affect that same user mounting the same shares from other client machines! It seems to be an isolated random flake, maybe twice a week. At any rate, I can log into the client machine and, even if I sudo to root, I cannot list the directory either — and curiously, we see some question marks for the permissions, owners (user and group), and other metadata about the mounted directory. I will anonymize the user and folder names below. The first is an unaffected, perfectly fine set of shares for one user "bobby", and the second is a user "jack" who is currently experiencing broken shares on 2 out of his 3 mounts.

/mnt/bobby:
total 64
drwx------ 2 bobby root     0 Aug  3 15:40 share1
drwx------ 2 bobby root 65536 Sep 28 09:20 share2
drwx------ 2 bobby root     0 Sep 29 15:36 share3

/mnt/jack:
ls: cannot access '/mnt/jack/share1': Permission denied
ls: cannot access '/mnt/jack/share3': Permission denied
total 64
d????????? ? ?           ?        ?            ? share1
drwx------ 2 jack        root 65536 Sep 28 09:20 share2
d????????? ? ?           ?        ?            ? share3

So the two shares "share1" and "share3" are mounted from this Linux file server that is currently generating the message. The other share "share2" happens to be mounted from a Windows-OS CIFS server elsewhere in the same Active Directory domain. When the problem happens, only share1 and share3 are affected; the end user has no problem accessing files from share2, which consumes his exact same Kerberos credential cache file to do the mounting, so it's not a general problem with the user's domain account, like password locked or something.

Once this happens, the end user has no way to fix it himself. I can (as root) "umount /mnt/jack/share1"; if jack happens to have a process stuck with some I/O or a current directory within that share, then umount says it's busy, but I can add the --lazy option and "umount -l" it, and it goes ahead and unmounts. Then after I've unmounted both share1 and share3, then jack can try again to go into the subdirectory, and autofs will successfully mount it just fine, and he's fixed and can get back to work.

However, I have also been trying to dissect if this whole thing is a client or server problem, and I noticed today that I see something unusual on the server end in the output of "smbstatus", specifically the top section "smbstatus --processes". The normal-looking output would be something like...

1152377 bobby        domain users 10.1.0.108 (ipv4:10.1.0.108:54124)        SMB3_11           -                    partial(AES-128-CMAC)
1152377 jack         domain users 10.1.0.108 (ipv4:10.1.0.108:54124)        SMB3_11           -                    partial(AES-128-CMAC)
1152377 jack         domain users 10.1.0.108 (ipv4:10.1.0.108:54124)        SMB3_11           -                    partial(AES-128-CMAC)
1152383 mallory      domain users 10.1.0.117 (ipv4:10.1.0.117:59624)        SMB3_11           -                    partial(AES-128-CMAC)
1152383 alice        domain users 10.1.0.117 (ipv4:10.1.0.117:59624)        SMB3_11           -                    partial(AES-128-CMAC)
1152383 richard      domain users 10.1.0.117 (ipv4:10.1.0.117:59624)        SMB3_11           -                    partial(AES-128-CMAC)
1152383 richard      domain users 10.1.0.117 (ipv4:10.1.0.117:59624)        SMB3_11           -                    partial(AES-128-CMAC)

...with each line showing what protocol and what signing algorithm is currently in effect between that client and this server. But when the problem is happening, perhaps the protocol is stuck while negotiating the SMB2 signing, because that last field just has a single hyphen in place of "partial (AES-128-CMAC)". So I have taken this as a useful indicator that the issue is happening, and in fact, if I grab the PID from the beginning of the line that is showing this lack-of-a-signing-algorithm, and I do a "kill 1152383" at the shell prompt (as root, no particular kill signal so I guess it is SIGTERM), then the problem appears to be instantly cleared. The "tail -F" of the client-specific Samba log file stops scrolling, and the question marks and Permission denied on the client go away, and the client successfully mount or remounts the share, and the end user is happy again.

So just this morning I made a hackish Python script, run as a cron job every few minutes, that grep's the output of smbstatus for those suspicious lines and kills off the malfunctioning PID (after waiting for 10 seconds and repeating the smbstatus command, to make sure it's an actual case and not just a race condition with a brand-new connection). I have some hope that this ridiculous hack will work around the problem and keep it away from my end users while hopefully someone can track down the root cause. Is there anything I can do (config files, logging, tcpdump captures, etc.) to help you find the actual cause? The file server is quite busy with high-bandwidth legitimate traffic every day, so running a continuous tcpdump writing to disk all the time would be very painful, especially because I can't recreate the problem at will.

Thanks very much, Jeremy and anyone else reading. I hope we can work together to find and smash this bug permanently. Let me know how I can help.

— Jeff Saxe

Jeff.Saxe at Quantitative.com<mailto:Jeff.Saxe at Quantitative.com>