Reproducable failure with rsync, iptables and RHEL4

Sun Feb 4 15:45:28 GMT 2007

Hello!

I have a consistent, reproducable failure performing an rsync of an 
RHEL4 system running rsync in daemon mode with iptables enabled.  With 
iptables disabled, or with a rule that explicitly allows all traffic, 
the rsync completes.  However, with iptalbes enabled, the rsync starts, 
but will not finish.  It fails after copying a seemingly random amount 
of data.

I have been able to reproduce this problem using a large variety of 
different computers, network switches, network cables, Linux 
distributions, kernel versions and rsync versions, installed in a 
variety of different physical locations.  In fact, I am confident I have 
changed out every other component in the process:  the only item that is 
consistently present when it fails is any version of RHEL (or CentOS) 4 
is on the daemon (source) end, up to and including a fully-updated 
version.  This problem did not happen with RHEL3 or previous versions. 
I have not found another Linux distribution that has this problem, 
either;  however, *most* of our Linux systems that we would back up this 
way are running some version of Red Hat Linux, so this is far from 
conclusive.

NOTE:  the rsync daemon system is 172.28.16.36.  The rsync client system 
is 172.28.16.35.

Normally, the path would be /, but for testing I have changed it to 
/test.  /test contains eight copies of /usr.  This eliminates special 
files, as well as gives me enough data to copy:  4.1GB worth.  The error 
occurs at some seemingly random point during the copy.  I have had it 
happen anywhere between 200MB and 2GB into the copy.

Here is the iptables ruleset for the source system:

# cat iptables
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:RH-Firewall-1-INPUT - [0:0]
-A INPUT -j RH-Firewall-1-INPUT
-A FORWARD -j RH-Firewall-1-INPUT
-A RH-Firewall-1-INPUT -i lo -j ACCEPT
-A RH-Firewall-1-INPUT -p icmp --icmp-type any -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j 
ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 873 -j 
ACCEPT
-A RH-Firewall-1-INPUT -s 172.28.16.35 -d 172.28.16.36 -p tcp --dport 
873 -j LOG
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
COMMIT
#

As you can see, I have pared this down to the bare minimum.  You can 
also see that I have added a LOG rule (the second to last one).  As far 
as I can tell, this rule should never be used:  either the packet will 
be a new request and will be handled by the line above the LOG, or it 
will be a packet that's part of the established connection, and will 
therefore be handled by the ESTABLISHED rule.  However, packets *are* 
logged by this rule.

With the above iptables rules enabled on the rsync daemon side, I get 
the following error on the destination system:

# rsync --numeric-ids --perms --owner --group -D --links --hard-links 
--times --block-size=2048 --recursive --one-file-system 
backupuser at 172.28.16.36::ROOT/* .
Password:
rsync: read error: No route to host
rsync error: error in rsync protocol data stream (code 12) at io.c(177)
rsync: connection unexpectedly closed (5682976 bytes read so far)
rsync error: error in rsync protocol data stream (code 12) at io.c(165)
#

If you watch the network between these two systems, you will see high 
traffic between them.  A top of either system shows that rsync is at the 
top.  However, suddenly the network traffic will stop.  A top of the 
system will no longer show rsync:  it will have been moved to the 
bottom.  About 10 minutes later, it will evenually time out with the 
error messages above.

Here is the rsync daemon's log for this time:
# tail /var/log/rsyncd.log
2007/02/04 00:27:41 [4386] name lookup failed for 172.28.16.35: Name or 
service not known
2007/02/04 00:27:43 [4386] rsync on ROOT/* from backupuser at UNKNOWN 
(172.28.16.35)
2007/02/04 05:45:50 [4386] rsync: writefd_unbuffered failed to write 
4096 bytes: phase "unknown" [sender]: Connection timed out (110)
2007/02/04 05:45:50 [4386] rsync error: error in rsync protocol data 
stream (code 12) at io.c(909)
#

Here is the iptables status during the rsync *before* the error occurs 
(edited to avoid the spam filter):

# iptables -L -v
Chain RH-Firewall-1-INPUT (2 references)
  pkts bytes target     prot opt in     out     source 
destination
  3477  181K ACCEPT     all  --  any    any     anywhere 
anywhere            state RELATED,ESTABLISHED
     1    60 ACCEPT     tcp  --  any    any     anywhere 
anywhere            state NEW tcp dpt:rsync
     0     0 LOG        tcp  --  any    any     172.28.16.35 
172.28.16.36        tcp dpt:rsync LOG level warning
#

As you can see, it looks exactly like what you would expect:  one packet 
starting the rsync request, and a bunch of established packets. 
However, notice the iptables status *after* the error has occurred:

# iptables -L -v
Chain RH-Firewall-1-INPUT (2 references)
  pkts bytes target     prot opt in     out     source 
destination
  256K   14M ACCEPT     all  --  any    any     anywhere 
anywhere            state RELATED,ESTABLISHED
     1    60 ACCEPT     tcp  --  any    any     anywhere 
anywhere            state NEW tcp dpt:rsync
    38  2732 LOG        tcp  --  any    any     172.28.16.35 
172.28.16.36        tcp dpt:rsync LOG level warning
#

There are 38 packets that were sent by the rsync client, but that 
iptables did not consider part of the established connection.  Here are 
the first four logged packets.  All 38 entries are at the bottom of this 
message.  (Actually, no:  the spam filter won't let me include them.)

# dmesg
ip_conntrack version 2.1 (4092 buckets, 32736 max) - 356 bytes per conntrack
IN=eth0 OUT= MAC=00:02:55:3b:23:51:00:02:55:5b:75:13:08:00 
SRC=172.28.16.35 DST=172.28.16.36 LEN=52 TOS=0x00 PREC=0x00 TTL=64 
ID=20551 DF PROTO=TCP SPT=32790 DPT=873 WINDOW=0 RES=0x00 ACK URGP=0
IN=eth0 OUT= MAC=00:02:55:3b:23:51:00:02:55:5b:75:13:08:00 
SRC=172.28.16.35 DST=172.28.16.36 LEN=52 TOS=0x00 PREC=0x00 TTL=64 
ID=20552 DF PROTO=TCP SPT=32790 DPT=873 WINDOW=2896 RES=0x00 ACK URGP=0
IN=eth0 OUT= MAC=00:02:55:3b:23:51:00:02:55:5b:75:13:08:00 
SRC=172.28.16.35 DST=172.28.16.36 LEN=52 TOS=0x00 PREC=0x00 TTL=64 
ID=20553 DF PROTO=TCP SPT=32790 DPT=873 WINDOW=2896 RES=0x00 ACK URGP=0
IN=eth0 OUT= MAC=00:02:55:3b:23:51:00:02:55:5b:75:13:08:00 
SRC=172.28.16.35 DST=172.28.16.36 LEN=52 TOS=0x00 PREC=0x00 TTL=64 
ID=20554 DF PROTO=TCP SPT=32790 DPT=873 WINDOW=5792 RES=0x00 ACK URGP=0
#

I do not know why iptables is filtering these packets.  However, this is 
what is killing my rsyncs.  To prove this, all you have to do is change 
  the LOG iptables rule to ACCEPT (like this:  -A RH-Firewall-1-INPUT -s 
172.28.16.35 -d 172.28.16.36 -p tcp --dport 873 -j ACCEPT) and the rsync 
will complete successfully.  Notice the iptables status after the rsync 
has completed:

# iptables -L -v
Chain RH-Firewall-1-INPUT (2 references)
  pkts bytes target     prot opt in     out     source 
destination
  983K   53M ACCEPT     all  --  any    any     anywhere 
anywhere            state RELATED,ESTABLISHED
     1    60 ACCEPT     tcp  --  any    any     anywhere 
anywhere            state NEW tcp dpt:rsync
  173K 9375K ACCEPT     tcp  --  any    any     172.28.16.35 
172.28.16.36        tcp dpt:rsync
#

As you can see, the client is sending *lots* of packets that are not 
being matched by the normal iptables rules, and need that extra second 
to last rule to prevent it from failing.  In fact, once a single packet 
matches that rule, it does not seem that any more packets are matched by 
the ESTABLISHED rule:  only that second to last rule's counts increase. 
  It seems like iptables is somehow forgetting about the connection.

Why is the rsync client's connection all of a sudden not being 
recognized?  Is the client sending the packets differently, or is 
iptables recognizing them differently?  Is there something I can do to 
get rsync to *not* send these packets, or to get iptables to filter them 
the way its supposed to?  What changed that is causing RHEL4 to handle 
this differently than RHEL3?

I would greatly appreciate any information you might be able to give me. 
  Believe it or not, this is the *abridged* version:  I've been 
struggling with this for several weeks now, trying scores of different 
permutations to try to resolve this.  On top of that, the list's spam 
filter has forced me to pare down or eliminate a number of items I had 
originally tried to include.  If you have any questions, or would like 
any additional information, please do not hesitate to ask.  Thank you 
very much for any help or insight you might be able to provide.

Timothy J. Massey