[Samba] samba AD problem after re-join domain

Mon Oct 12 15:11:34 UTC 2020

On 10/12/2020 10:36 AM, Jason Keltz wrote:
>
> On 10/12/2020 4:06 AM, Rowland penny via samba wrote:
>> On 12/10/2020 02:54, Jason Keltz via samba wrote:
>>> I've been working on a Samba AD setup with a bunch of test machines 
>>> - the one DC, and a bunch of clients. Last night, I ended up 
>>> switching the name of the test machines temporarily (except the DC), 
>>> and re-joining the domain (that's for another e-mail later). When 
>>> things didn't work the way I had planned,  I switched the hostnames 
>>> back, and re-joined the domain today on all the test machines.  I 
>>> was shocked to find that I am only able to login to the domain on 
>>> one of my hosts. It fails on all the other ones.  I ensured that I 
>>> deleted the machine entries from AD.  I haven't changed my Samba 
>>> config in months which Rowland had last verified was fine.  I 
>>> haven't changed my /etc/krb5.conf Kerberos config in months.  I even 
>>> did a complete rebuild of one of the machines since I automated the 
>>> installation process, and that rebuild was working perfectly many 
>>> many times, but now it is failed. In winbind log every time I try to 
>>> login I'm mostly seeing:
>>
>> Did you leave the domain before you changed the hostname ?
>>
>> Why did you change the hostnames ? In a case like this, I would have 
>> set up a new computer, joined this to the domain and then removed the 
>> old computer from the domain. 
>
> Hi Rowland,
>
> I did not leave the domain, but I did delete the entry by either the 
> Windows AD tool or "samba-tool computer delete" option.  I can't 
> remember which one at this point.  I think that clears up all the 
> bits.  Is that correct?  On the local host, I also deleted the 
> /etc/krb5.keytab, and deleted all the samba bits so that the join was 
> fresh.
>
> Things are better today.  I discovered one issue which seemingly 
> unrelated (to me) to the errors seems to have been the cause of a lot 
> of the trouble.  I was chasing errors in winbind log, but several of 
> the test servers are NFS servers, and when I rejoined them to the 
> domain, I didn't replace the nfs/X entries in their keytab.  Now, the 
> clients couldn't mount, and that definately caused some trouble, for 
> which I didn't see the signs.  I'm still watching though. However, I 
> can login to all the hosts now.
>
> By the way, at one point, I rebooted the DC, and I noticed that all 
> the AD clients showed something like this:
>
> [2020/10/12 09:25:19.183616,  1, pid=36145, effective(0, 0), real(0, 
> 0)] 
> ../../source3/rpc_client/cli_pipe.c:422(cli_pipe_validate_current_pdu)
>   ../../source3/rpc_client/cli_pipe.c:422: Bind NACK received from 
> host dc1.ad.eecs.yorku.ca!
> [2020/10/12 09:44:11.598150,  1, pid=36145, effective(0, 0), real(0, 
> 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal)
>   Reducing LDAP page size from 1000 to 500 due to IO_TIMEOUT
>
> (Which is strange because this means that if you reboot he DC, then 
> the clients start talking slower to it when it comes back up?  I don't 
> think the number ever increases unless you restart winbind everywhere?)
>
> and since that reboot, I've seen a few of them do this:
>
> [2020/10/12 10:00:19.814381,  1, pid=36145, effective(0, 0), real(0, 
> 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal)
>   Reducing LDAP page size from 500 to 250 due to IO_TIMEOUT
> [2020/10/12 10:16:19.557261,  1, pid=36145, effective(0, 0), real(0, 
> 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal)
>   Reducing LDAP page size from 250 to 125 due to IO_TIMEOUT
>
> Two of them are virtualbox VMs, so I figured maybe it's some kind of 
> virtualbox thing, but one of them is an actual machine and still has 
> the same error.  The DC is very lightly loaded.  How would I debug 
> what is causing this reduction in IO?
>
> I know that various errors in the Samba logs are not "issues" but this 
> one seems to be an issue.  I don't like seeing IO_TIMEOUTs.
>
> Another distracting error in the log included:
>
> [2020/10/11 22:43:29.843630,  1, pid=969, effective(0, 0), real(0, 0)] 
> ../../source3/libads/ldap.c:565(ads_find_dc)
>   ads_find_dc: name resolution for realm 'AD.EECS.YORKU.CA' (domain 
> 'EECSYORKUCA') failed: NT_STATUS_NO_LOGON_SERVERS
>
> ... after boot which sounds serious but it turns out if I try to 
> authenticate before everything is up and running, that's what I get. 
> The error makes sense but there's no "follow up" to say: "Ok ok - I 
> found it now - Sorry to give you a heart attack.".  It's all a 
> learning experience.
>
> <snipped>
> Jason

I wanted to add one more thing...  It seems that I'm actually still 
getting this everywhere when a user logs in:

[2020/10/12 10:59:29.958617,  1, pid=23338, effective(1004, 0), 
real(1004, 0)] 
../../source3/librpc/crypto/gse_krb5.c:417(fill_mem_keytab_from_system_keytab)
   ../../source3/librpc/crypto/gse_krb5.c:417: krb5_kt_start_seq_get 
failed (Permission denied)

... but at least the user can still login.

I wonder if this a regular error and everyone is seeing this in their 
logs?  Just for fun, I tried to change the permission of 
/etc/krb5.keytab temporarily to 644, and sure enough, the error goes 
away....  so somehow when the user is logging in, it seems that winbind 
is trying to read the keytab as user.  It's not clear why that would be, 
but while a google search hasn't revealed the reason for this error, I 
do see it in a whole lot of log files. It's just that when I'm trying to 
ensure there are no problems with my setup, and trying to understand the 
errors that do show up, it can cause panic.  Whether it's a problem or 
not, I do not know.

Jason.