I have been working on a problem where a backup jobs fail with NetBackup Error 24. These tend to be intermittent failures with some hosts failing more than others. Recently I took a second look at a host that was frequently failing and this time I got lucky.
Using the long running capture techniques described below I was able to trace down the connection reset. What I found after studying the trace was that SACK (Selective Acknowledgement) was handling the communication about what data needed to be retransmitted and it didn’t work. I don’t know SACK well enough to tell which side failed to respond correctly but it looks like the Win03 (sender) should have provided the missing data (which should still be in its buffer, I think) to the receiver (Solaris). In any case we disabled SACK on the Win03 host and haven’t had a failure since.
Here is the trace with my comments.