Problems with SACK

I have been working on a problem where a backup jobs fail with NetBackup Error 24. These tend to be intermittent failures with some hosts failing more than others. Recently I took a second look at a host that was frequently failing and this time I got lucky.

Using the long running capture techniques described below I was able to trace down the connection reset. What I found after studying the trace was that SACK (Selective Acknowledgement) was handling the communication about what data needed to be retransmitted and it didn’t work. I don’t know SACK well enough to tell which side failed to respond correctly but it looks like the Win03 (sender) should have provided the missing data (which should still be in its buffer, I think) to the receiver (Solaris). In any case we disabled SACK on the Win03 host and haven’t had a failure since.

Here is the trace with my comments.

Image

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s