[Samba] Network rebuild advice needed

Peter Pollock peter.pollock at kingschristian.org
Mon Aug 31 01:39:55 UTC 2020


Thank  you Andrew for your patient and calming advice.

We're a small private school and I'm basically a volunteer here. When I set
this all up 4 years ago everything was peachy but problems started
appearing and now seem to be cascading. I've considered getting
professional help, but I'll pretty much have to ask my wife to do overtime
and give me the money as an early Christmas present to pay for it. We just
don't have the budget, especially right now in the middle of COVID.

The DBcheck came up with a couple of errors relating to an old (removed) DC
so I ran the fix and fixed them, but it still gave that same RID issue when
I tried the join.

My solution was to switch the second server (Luke) off and reload server 1
(Genesis) from a backup. That came up fine and I could join the new DC with
no problems.

Due to Zentyal collapsing if I make any changes whatsoever (you have to
save changes in Zentyal and it hangs saving the DNS or Samba module EVERY
time and I've not been able to find a way around it, running updates hangs
on updating the Zentyal modules too and dpkg --configure -a doesn't help) I
want to wipe and reinstall the two zentyal servers.

My thought was,

1) build a new server (I personally bought the parts to fix an old server
we were keeping for spares so I have another one to use).
2) Join the new server to the domain and ensure it replicates.
3) Wipe the two old servers and rebuild them, rejoining them to the new
server to replicate everything back across
4) Be happy for a few minutes

Genesis is our primary DNS and holds all the FSMO roles, it is also our
gateway and DCHP server and holds the windows roaming profiles..

My plan for this weekend was to use an old PC to build a seperate DHCP
server/gateway so that every time Genesis went down we didn't lose all
internet access. That server worked, which meant I thought my weekend had
been a success at first, but of course Genesis failed when I tried to
switch off its DHCP module so things came crashing to a halt. I was
planning on rebuilding the servers NEXT weekend but when Genesis wouldn't
let me switch off its DHCP server, I figured I needed to bring my other
plan forward.

That's why I was building a new server yesterday and once I got past the
RID issue, it seemed to be OK.... except I could never make it actually
operate as a DNS server and even though I switched Genesis and Luke off,
turned on my new DHCP server, pointed everything at the new DC and
successfully started serving DHCP addresses with all the right details to
the clients, they refused to use the new DC to authenticate logons (can't
find a logon server) and the routing went up and down like a yo-yo even
when I made no changes whatsoever. It was awful, one second I'm browsing
the net, reading every article I can find on configuring bind correctly and
the next second, poof, no internet and the new server is suddenly saying it
has a temporary failure in name resolution - yet I couldn't find a single
setting that had changed.

I've now switched the new DC back off, switched Genesis back on, reloaded
from a backup from June, before all the database problems started, removed
all references to dead DC's from it, run DBCHECK and everything is good...
except random PC's occasionally suddenly decide that they have no trust
relationship with the server any more... and I can't run any updates (last
update was in February, I think).

Now I need to switch Luke back on, because he's a DC and also our
fileserver, but when I do, I know that he and Genesis are going to try to
replicate and he has a newer version of everything so they're going to talk
for a minute then have another falling out and things will be fairly
screwed again.

I wanted today to at least wipe Luke so he doesn't cause problems, but I
don't know that that is even possible. Genesis may not be kind enough to
allow him to rejoin when I rebuild him.

Teachers will be back on site in 12 hours so I have some decisions to make.
It looks like the best choice right now is to switch Luke back on so
everything is back how it was (broken but limping along) on Friday
afternoon when I kicked the teachers off the network and call it a wasted
weekend.

Oh, and I've been trying the backup. It looks like the directories private
and sysvol are in the /var/lib/samba directory but I cannot find anywhere
where there is a /samba/etc directory and I don't know exactly what is
supposed to be in it so it's hard to search for it.

Peter

On Sun, Aug 30, 2020 at 2:06 PM Andrew Bartlett <abartlet at samba.org> wrote:

> On Sun, 2020-08-30 at 02:57 -0700, Peter Pollock wrote:
> > Yeah, I already tried that, to no avail.
>
> What exactly was the error or symptom?
>
> Did you try the backup and restore I suggested?
>
> If the only issue is the RID, there are ways to get past that - changing
> the nextRid for example, but you might wish to engage professional Samba
> support.
>
> I get the impression you are pretty stressed.  One of the things I've seen
> with stressed administrators over the years is big jumps without careful
> notes.  Take each step slowly, get good advise (here or professionally) and
> note down each time what works, what doesn't etc.  Do this before you move
> to the next step.
>
> If you can do a trail run in a (truly) disconnected LAB while leaving
> production untouched then it can be less stressful.  The backup/restore
> tools change the major keys, which can avoid the networks talking to each
> other, so please don't ignore them!
>
> Thanks,
>
> Andrew Bartlett
>
> --
> Andrew Bartlett                       http://samba.org/~abartlet/
> Authentication Developer, Samba Team  http://samba.org
> Samba Developer, Catalyst IT
> http://catalyst.net.nz/services/samba
>
>
>


More information about the samba mailing list