Transfer hangs, both sides waiting on receive
Apollon Oikonomopoulos
apoikos at debian.org
Wed Apr 5 11:49:30 UTC 2017
Hi,
We're encountering a rather weird, sporadic hang during a large-file
transfer. A bit of background on the setup:
- Host A rsyncs a MySQL database from Host B. Both sides use rsync
3.1.1 on Debian Jessie.
- Host B serves the files using rsync --daemon from an ext4 filesystem
residing on a thin LVM snapshot.
- Host A uses the following command to rsync the files to a btrfs subvolume:
rsync -og -avPH --log-file=/var/log/mysql-rsync.log --debug=ALL --chown=user:user --delete --inplace rsync://hostb.example.com/mysql/ /media/btrfs/db-snapshot-$$/mysql
(I know parts of the command (-og + -a + --chown) don't make much
sense :)
- Both hosts sit on the same LAN, connected to a Juniper EX4200 stack
with active-backup interfaces. MTU is standard 1500 and there are no
known network issues. Host A has Intel I350 (igb) NICs, while Host B has
BCM5709 (bnx2) NICs. Both hosts run Linux 3.16.36.
Once every few (~3-5) transfers, the rsync process hangs with both sides
waiting on select() on the TCP socket. From the transfer logs I can see
that the hang always happens when about to receive the same 32GB file,
already a couple of GBs into the total transfer.
Host A reports:
....
04:41:03 [6281] [receiver] send_msg_int(100, 1701)
04:41:03 [6281] recv_files(mysql/innodb_table_stats.ibd)
04:41:03 [6281] sender finished //mysql/innodb_table_stats.ibd
04:41:03 [6281] send_files(1749, //mysql/user.MYD)
04:41:03 [6281] sender finished //mysql/user.MYD
04:41:03 [6281] send_files(1750, //mysql/user.MYI)
04:41:03 [6281] sender finished //mysql/user.MYI
04:41:03 [6281] send_files(1806, //some_dbase)
04:41:03 [6281] send_files(1808, //some_dbase/actions.ibd)
04:41:03 [6281] >f.st...... mysql/innodb_table_stats.ibd
04:41:03 [6281] finishing mysql/innodb_table_stats.ibd
04:41:03 [6281] set modtime of mysql/innodb_table_stats.ibd to (1491343251) Wed Apr 5 01:00:51 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1703)
04:41:03 [6281] recv_files(mysql/user.MYD)
04:41:03 [6281] >f.st...... mysql/user.MYD
04:41:03 [6281] finishing mysql/user.MYD
04:41:03 [6281] set modtime of mysql/user.MYD to (1487686529) Tue Feb 21 16:15:29 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1749)
04:41:03 [6281] recv_files(mysql/user.MYI)
04:41:03 [6281] >f..t...... mysql/user.MYI
04:41:03 [6281] finishing mysql/user.MYI
04:41:03 [6281] set modtime of mysql/user.MYI to (1487689840) Tue Feb 21 17:10:40 2017
04:41:03 [6281] [receiver] send_msg_int(100, 1750)
04:41:03 [6281] recv_files(some_dbase)
04:41:03 [6281] .d..t...... some_dbase/
<silence>
while Host B reports:
....
01:41:03 [3726] send_files(1749, //mysql/user.MYD)
01:41:03 [3726] <f.st...... mysql/user.MYD
01:41:03 [3726] sender finished //mysql/user.MYD
01:41:03 [3726] send_files(1750, //mysql/user.MYI)
01:41:03 [3726] <f..t...... mysql/user.MYI
01:41:03 [3726] sender finished //mysql/user.MYI
01:41:03 [3726] send_files(1806, //some_dbase)
01:41:03 [3726] .d..t...... some_dbase/
01:41:03 [3726] send_files(1808, //some_dbase/actions.ibd)
<silence>
The backtraces of the hung processes are:
Host A (receiver):
#0 0x00007f87d7da3873 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f87d88dd7a2 in perform_io (needed=needed at entry=4, flags=flags at entry=17) at io.c:742
#2 0x00007f87d88deef7 in raw_read_int () at io.c:908
#3 0x00007f87d88df683 in read_a_msg () at io.c:1423
#4 0x00007f87d88e0676 in read_buf (f=f at entry=4, buf=buf at entry=0x7ffd814c0df0 "\300\037L\201\375\177", len=len at entry=1) at io.c:1835
#5 0x00007f87d88e157b in read_ndx (f=f at entry=4) at io.c:2222
#6 0x00007f87d88be377 in read_ndx_and_attrs (f_in=4, f_out=6, iflag_ptr=0x7ffd814c0f28, type_ptr=0x7ffd814c0f27 "\200\b",
buf=0x7ffd814c1fc0 "", len_ptr=0x7ffd814c0f2c) at rsync.c:320
#7 0x00007f87d88c5b5e in recv_files (f_in=5, f_out=-2125728480, local_name=0x7ffd814c0fc0 "some_dbase") at receiver.c:548
#8 0x00007f87d88d0787 in do_recv (f_in=4, f_out=6, local_name=0x0) at main.c:906
#9 0x00007f87d88d1218 in client_run (f_in=4, f_out=4, pid=-1, argc=<optimized out>, argv=<optimized out>) at main.c:1213
#10 0x00007f87d88b4330 in start_client (argv=<optimized out>, argc=<optimized out>) at main.c:1382
#11 main (argc=2, argv=0x7f87d9000240) at main.c:1651
Host B (sender):
#0 0x00007f9a7130a873 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007f9a71e447a2 in perform_io (needed=needed at entry=4, flags=flags at entry=17) at io.c:742
#2 0x00007f9a71e45e2e in raw_read_buf (buf=0x7ffe7c1aa380 "", len=4) at io.c:895
#3 0x00007f9a71e47643 in read_buf (f=<optimized out>, buf=buf at entry=0x7ffe7c1aa380 "", len=len at entry=4) at io.c:1840
#4 0x00007f9a71e47851 in read_int (f=<optimized out>) at io.c:1711
#5 0x00007f9a71e2f652 in receive_sums (f=<optimized out>) at sender.c:96
#6 send_files (f_in=4, f_out=1930300096) at sender.c:332
#7 0x00007f9a71e37f6c in do_server_sender (argv=<optimized out>, argc=<optimized out>, f_out=3, f_in=3) at main.c:840
#8 start_server (f_in=3, f_out=3, argc=2, argv=0x7f9a7302f240) at main.c:1107
#9 0x00007f9a71e5a34e in rsync_module (f_in=3, f_out=3, i=0, addr=0x7f9a72098518 <numeric_ids> "\377\377\377\377",
host=0x7f9a72098c80 <name_buf> "host-a.example.com") at clientserver.c:1009
#10 0x00007f9a71e5b260 in start_daemon (f_in=3, f_out=3) at clientserver.c:1100
#11 0x00007f9a71e4f5d1 in start_accept_loop (port=4, fn=0x7ffe7c1aa160, fn at entry=0x7f9a71e5b120 <start_daemon>) at socket.c:618
#12 0x00007f9a71e5bb46 in daemon_main () at clientserver.c:1202
#13 0x00007f9a71e1aeb1 in main (argc=0, argv=0x0) at main.c:1626
The worrysome part is the Send-Q on Host A, which last time contained 263392
unACK'd bytes. The rest of the queues involved were empty. tcpdump
reveals no retransmission attempts, only 2 window updates and a stream
of TCP keepalives after the point where the transfer halts:
Bytes Time (s) SRC DST Proto Details
12995 1.637946 hostb hosta RSYNC Server MOTD
86 1.637963 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37686543 Win=23993 Len=0 TSval=2255190023 TSecr=100923227
86 1.638406 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37686543 Ack=1 Win=1688 Len=0 TSval=100923227 TSecr=2255190022
1514 1.641767 hostb hosta RSYNC Server MOTD
86 1.641802 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37687971 Win=23985 Len=0 TSval=2255190024 TSecr=100923228
1381 1.641938 hostb hosta RSYNC Server MOTD
86 1.641955 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37689266 Win=23977 Len=0 TSval=2255190024 TSecr=100923228
86 1.647888 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=6840 Len=0 TSval=100923230 TSecr=2255190024
86 121.931963 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255220097 TSecr=100923230
86 121.932062 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100953302 TSecr=2255190024
86 242.248005 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255250176 TSecr=100953302
86 242.248090 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100983381 TSecr=2255190024
86 362.567974 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255280256 TSecr=100983381
86 362.568085 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101013461 TSecr=2255190024
86 482.891958 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255310336 TSecr=101013461
86 482.892026 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101043542 TSecr=2255190024
>From the backtraces it appears as though the sender is waiting for the
checksums the client has sent, while the client is waiting for the next file id
to come in. I can only assume the deadlock is because of the client's checksums
sitting in its Send-Q. The only potentially related part I can find in the
traffic captures are a couple of Zero Window responses by hostb, which however
are followed by window updates with non-zero sizes.
Any ideas or hints what else to look for?
Regards,
Apollon
More information about the rsync
mailing list