Fwd: backing up to a Drobo CIFS VFS: Error -11 sending data on socket to server, Bufferbloat?

Fri May 12 14:11:05 UTC 2017

On Fri, May 12, 2017 at 6:45 AM, Robert Kudyba <rkudyba at fordham.edu> wrote:
>> Given your kernel log message I would expect seeing the reconnecting happen.
>
> I scheduled a tshark -c over night and was able to get it to run
> within minutes of the reconnect attempts.It seems pretty consistent,
> about 2 hours after the start of the cron the Error -11's start to
> happen. Last night it also happened about 30 minutes into the job:
>
> May 11 23:47:51 localworkstation kernel: CIFS VFS: Server drobo-rh has
> not responded in 120 seconds. Reconnecting...
> May 11 23:49:59 localworkstation kernel: CIFS VFS: Server drobo-rh has
> not responded in 120 seconds. Reconnecting...
> May 11 23:52:57 localworkstation kernel: CIFS VFS: Server drobo-rh has
> not responded in 120 seconds. Reconnecting...
>
> May 12 01:19:44 localworkstation cupsd: REQUEST localhost - - "POST /
> HTTP/1.1" 200 181 Renew-Subscription successful-ok
> May 12 01:21:20 localworkstation kernel: CIFS VFS: sends on sock
> ffff9f2817bede00 stuck for 15 seconds
> May 12 01:21:20 localworkstation kernel: CIFS VFS: Error -11 sending
> data on socket to server
>
> Here's the tshark -r tshark.pcap -Y 'smb||smb2' results from 2 files:
>
>   593 0.273687063 drobo → localserver SMB 117 [TCP ACKed unseen
> segment] Write AndX Response, 65536 bytes
>   666 0.282264835 drobo → localserver SMB 117 Write AndX Response, 65536 bytes
>     8 0.136979937 drobo → localserver SMB 117 [TCP ACKed unseen
> segment] [TCP Previous segment not captured] Write AndX Response,
> 65536 bytes
>    36 0.142752037 drobo → localserver SMB 117 Write AndX Response, 65536 bytes
>    63 0.149017707 drobo → localserver SMB 117 Write AndX Response, 65536 bytes
>
>> I would wrap your backup script like so to get the right capture:
>>
>>     capfile="$(date -Ihours)"
>>     tcpdump -w "$capfile" port 445 &
>>     cappid=$!
>>     sleep 2
>>     .....<your script>....
>>     sleep 2
>>     kill -6 $cappid
>
> OK just did that and started running the backup manually so far I've
> seen variations/multiples of these (cut for easier reading):
> 38735  18.521926 localserver → drobo SMB 10202 Write AndX Request,
> FID: 0x1fca, 65536 bytes at offset 87949312[TCP segment of a
> reassembled PDU]
> 38760  18.527189 localserver → drobo SMB 7306 Write AndX Request, FID:
> 0x1fca, 65536 bytes at offset 88014848[TCP segment of a reassembled
> PDU]
> 38771  18.529817 drobo → localserver SMB 117 Write AndX Response, FID:
> 0x1fca, 65536 bytes
> 44079  19.615349 drobo → localserver SMB 117 [TCP ACKed unseen
> segment] Write AndX Response, 65536 bytes
> 44102  19.626642 drobo → localserver SMB 117 [TCP ACKed unseen
> segment] Write AndX Response, 65536 bytes
> --
> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

You are seeing some packet drops which might make it hard to figure
out the issue, but those are simply the writes.

What I would look for is places where the writes stop.

You will likely see it better in wireshark by using Statistics->I/O
Graph as it will show you where the throughput rate drops to zero.

Then you can click on the graph at the point where it drops to zero to
get the main window to go to that packet and then you can relate it to
the tshark output.

The earliest message you posted suggests that the client (Linux) was
complaining of no activity for 120 seconds, which suggests that the
Drobo stopped responding at some point. Thus I would expect to see one
or more writes that did not get responses, followed by a connection
drop and reconnect.

-- 
Regards,
Richard Sharpe
(何以解憂？唯有杜康。--曹操)