Transfer hangs, both sides waiting on receive

Apollon Oikonomopoulos apoikos at debian.org
Wed Apr 5 11:49:30 UTC 2017


Hi,

We're encountering a rather weird, sporadic hang during a large-file 
transfer. A bit of background on the setup:

 - Host A rsyncs a MySQL database from Host B. Both sides use rsync 
   3.1.1 on Debian Jessie.

 - Host B serves the files using rsync --daemon from an ext4 filesystem 
   residing on a thin LVM snapshot.

 - Host A uses the following command to rsync the files to a btrfs subvolume:
   rsync -og -avPH --log-file=/var/log/mysql-rsync.log --debug=ALL --chown=user:user --delete --inplace rsync://hostb.example.com/mysql/ /media/btrfs/db-snapshot-$$/mysql
   (I know parts of the command (-og + -a + --chown) don't make much 
   sense :)

 - Both hosts sit on the same LAN, connected to a Juniper EX4200 stack 
   with active-backup interfaces. MTU is standard 1500 and there are no 
	 known network issues. Host A has Intel I350 (igb) NICs, while Host B has
   BCM5709 (bnx2) NICs. Both hosts run Linux 3.16.36.

Once every few (~3-5) transfers, the rsync process hangs with both sides 
waiting on select() on the TCP socket. From the transfer logs I can see 
that the hang always happens when about to receive the same 32GB file, 
already a couple of GBs into the total transfer.

Host A reports:
....
04:41:03 [6281] [receiver] send_msg_int(100, 1701)
04:41:03 [6281] recv_files(mysql/innodb_table_stats.ibd)
04:41:03 [6281] sender finished //mysql/innodb_table_stats.ibd
04:41:03 [6281] send_files(1749, //mysql/user.MYD)
04:41:03 [6281] sender finished //mysql/user.MYD
04:41:03 [6281] send_files(1750, //mysql/user.MYI)
04:41:03 [6281] sender finished //mysql/user.MYI
04:41:03 [6281] send_files(1806, //some_dbase)
04:41:03 [6281] send_files(1808, //some_dbase/actions.ibd)
04:41:03 [6281] >f.st...... mysql/innodb_table_stats.ibd
04:41:03 [6281] finishing mysql/innodb_table_stats.ibd
04:41:03 [6281] set modtime of mysql/innodb_table_stats.ibd to (1491343251) Wed Apr  5 01:00:51 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1703)
04:41:03 [6281] recv_files(mysql/user.MYD)
04:41:03 [6281] >f.st...... mysql/user.MYD
04:41:03 [6281] finishing mysql/user.MYD
04:41:03 [6281] set modtime of mysql/user.MYD to (1487686529) Tue Feb 21 16:15:29 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1749)
04:41:03 [6281] recv_files(mysql/user.MYI)
04:41:03 [6281] >f..t...... mysql/user.MYI
04:41:03 [6281] finishing mysql/user.MYI
04:41:03 [6281] set modtime of mysql/user.MYI to (1487689840) Tue Feb 21 17:10:40 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1750)
04:41:03 [6281] recv_files(some_dbase)
04:41:03 [6281] .d..t...... some_dbase/
<silence>

while Host B reports:
....
01:41:03 [3726] send_files(1749, //mysql/user.MYD)
01:41:03 [3726] <f.st...... mysql/user.MYD
01:41:03 [3726] sender finished //mysql/user.MYD
01:41:03 [3726] send_files(1750, //mysql/user.MYI)
01:41:03 [3726] <f..t...... mysql/user.MYI
01:41:03 [3726] sender finished //mysql/user.MYI
01:41:03 [3726] send_files(1806, //some_dbase)
01:41:03 [3726] .d..t...... some_dbase/
01:41:03 [3726] send_files(1808, //some_dbase/actions.ibd)
<silence>

The backtraces of the hung processes are:

Host A (receiver):
 #0  0x00007f87d7da3873 in select () from /lib/x86_64-linux-gnu/libc.so.6
 #1  0x00007f87d88dd7a2 in perform_io (needed=needed at entry=4, flags=flags at entry=17) at io.c:742
 #2  0x00007f87d88deef7 in raw_read_int () at io.c:908
 #3  0x00007f87d88df683 in read_a_msg () at io.c:1423
 #4  0x00007f87d88e0676 in read_buf (f=f at entry=4, buf=buf at entry=0x7ffd814c0df0 "\300\037L\201\375\177", len=len at entry=1) at io.c:1835
 #5  0x00007f87d88e157b in read_ndx (f=f at entry=4) at io.c:2222
 #6  0x00007f87d88be377 in read_ndx_and_attrs (f_in=4, f_out=6, iflag_ptr=0x7ffd814c0f28, type_ptr=0x7ffd814c0f27 "\200\b", 
     buf=0x7ffd814c1fc0 "", len_ptr=0x7ffd814c0f2c) at rsync.c:320
 #7  0x00007f87d88c5b5e in recv_files (f_in=5, f_out=-2125728480, local_name=0x7ffd814c0fc0 "some_dbase") at receiver.c:548
 #8  0x00007f87d88d0787 in do_recv (f_in=4, f_out=6, local_name=0x0) at main.c:906
 #9  0x00007f87d88d1218 in client_run (f_in=4, f_out=4, pid=-1, argc=<optimized out>, argv=<optimized out>) at main.c:1213
 #10 0x00007f87d88b4330 in start_client (argv=<optimized out>, argc=<optimized out>) at main.c:1382
 #11 main (argc=2, argv=0x7f87d9000240) at main.c:1651

Host B (sender):
 #0  0x00007f9a7130a873 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
 #1  0x00007f9a71e447a2 in perform_io (needed=needed at entry=4, flags=flags at entry=17) at io.c:742
 #2  0x00007f9a71e45e2e in raw_read_buf (buf=0x7ffe7c1aa380 "", len=4) at io.c:895
 #3  0x00007f9a71e47643 in read_buf (f=<optimized out>, buf=buf at entry=0x7ffe7c1aa380 "", len=len at entry=4) at io.c:1840
 #4  0x00007f9a71e47851 in read_int (f=<optimized out>) at io.c:1711
 #5  0x00007f9a71e2f652 in receive_sums (f=<optimized out>) at sender.c:96
 #6  send_files (f_in=4, f_out=1930300096) at sender.c:332
 #7  0x00007f9a71e37f6c in do_server_sender (argv=<optimized out>, argc=<optimized out>, f_out=3, f_in=3) at main.c:840
 #8  start_server (f_in=3, f_out=3, argc=2, argv=0x7f9a7302f240) at main.c:1107
 #9  0x00007f9a71e5a34e in rsync_module (f_in=3, f_out=3, i=0, addr=0x7f9a72098518 <numeric_ids> "\377\377\377\377", 
     host=0x7f9a72098c80 <name_buf> "host-a.example.com") at clientserver.c:1009
 #10 0x00007f9a71e5b260 in start_daemon (f_in=3, f_out=3) at clientserver.c:1100
 #11 0x00007f9a71e4f5d1 in start_accept_loop (port=4, fn=0x7ffe7c1aa160, fn at entry=0x7f9a71e5b120 <start_daemon>) at socket.c:618
 #12 0x00007f9a71e5bb46 in daemon_main () at clientserver.c:1202
 #13 0x00007f9a71e1aeb1 in main (argc=0, argv=0x0) at main.c:1626
 
The worrysome part is the Send-Q on Host A, which last time contained 263392
unACK'd bytes. The rest of the queues involved were empty. tcpdump 
reveals no retransmission attempts, only 2 window updates and a stream 
of TCP keepalives after the point where the transfer halts:

 Bytes  Time (s)       SRC           DST        Proto    Details
 12995  1.637946       hostb        hosta       RSYNC    Server MOTD
 86     1.637963       hosta       hostb        TCP      39115→rsync(873) [ACK] Seq=1 Ack=37686543 Win=23993 Len=0 TSval=2255190023 TSecr=100923227
 86     1.638406       hostb        hosta       TCP      [TCP Window Update] rsync(873)→39115 [ACK] Seq=37686543 Ack=1 Win=1688 Len=0 TSval=100923227 TSecr=2255190022
 1514   1.641767       hostb        hosta       RSYNC    Server MOTD
 86     1.641802       hosta       hostb        TCP      39115→rsync(873) [ACK] Seq=1 Ack=37687971 Win=23985 Len=0 TSval=2255190024 TSecr=100923228
 1381   1.641938       hostb        hosta       RSYNC    Server MOTD
 86     1.641955       hosta       hostb        TCP      39115→rsync(873) [ACK] Seq=1 Ack=37689266 Win=23977 Len=0 TSval=2255190024 TSecr=100923228
 86     1.647888       hostb        hosta       TCP      [TCP Window Update] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=6840 Len=0 TSval=100923230 TSecr=2255190024
 86     121.931963     hosta       hostb        TCP      [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255220097 TSecr=100923230
 86     121.932062     hostb        hosta       TCP      [TCP Window Update] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100953302 TSecr=2255190024
 86     242.248005     hosta       hostb        TCP      [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255250176 TSecr=100953302
 86     242.248090     hostb        hosta       TCP      [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100983381 TSecr=2255190024
 86     362.567974     hosta       hostb        TCP      [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255280256 TSecr=100983381
 86     362.568085     hostb        hosta       TCP      [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101013461 TSecr=2255190024
 86     482.891958     hosta       hostb        TCP      [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255310336 TSecr=101013461
 86     482.892026     hostb        hosta       TCP      [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101043542 TSecr=2255190024

>From the backtraces it appears as though the sender is waiting for the
checksums the client has sent, while the client is waiting for the next file id
to come in. I can only assume the deadlock is because of the client's checksums
sitting in its Send-Q. The only potentially related part I can find in the
traffic captures are a couple of Zero Window responses by hostb, which however
are followed by window updates with non-zero sizes.

Any ideas or hints what else to look for?

Regards,
Apollon



More information about the rsync mailing list