Problems with SACK

I have been working on a problem where a backup jobs fail with NetBackup Error 24. These tend to be intermittent failures with some hosts failing more than others. Recently I took a second look at a host that was frequently failing and this time I got lucky.

Using the long running capture techniques described below I was able to trace down the connection reset. What I found after studying the trace was that SACK (Selective Acknowledgement) was handling the communication about what data needed to be retransmitted and it didn’t work. I don’t know SACK well enough to tell which side failed to respond correctly but it looks like the Win03 (sender) should have provided the missing data (which should still be in its buffer, I think) to the receiver (Solaris). In any case we disabled SACK on the Win03 host and haven’t had a failure since.

Here is the trace with my comments.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s