Problems with SACK

I have been working on a problem where a backup jobs fail with NetBackup Error 24. These tend to be intermittent failures with some hosts failing more than others. Recently I took a second look at a host that was frequently failing and this time I got lucky.

Using the long running capture techniques described below I was able to trace down the connection reset. What I found after studying the trace was that SACK (Selective Acknowledgement) was handling the communication about what data needed to be retransmitted and it didn’t work. I don’t know SACK well enough to tell which side failed to respond correctly but it looks like the Win03 (sender) should have provided the missing data (which should still be in its buffer, I think) to the receiver (Solaris). In any case we disabled SACK on the Win03 host and haven’t had a failure since.

Here is the trace with my comments.

Image

Finding timed out sessions in the firewall log

I have used this FW log filter to find tcp sessions that have been dropped by the FW due to inactivity.  Filter for Drops and in the Info column Contains(tcp_flags: PUSH-ACK). Once you find one you can use the ephemeral (source) port number, which is quite unique, to filter for the session setup. Its the time between setup and drop you want to check, you need to make sure its greater than the inactivity timer on the FW.

I have seen some sessions torn down by the FW 60 minutes after setup since the application created a number of sessions on startup but doesn’t always use them. These all need tcp keepalves to stay up in the FW.

Check SO_KEEPALIVE is being enabled

If you think tcp keepalives are being enabled but want to confirm this is relatively easy on Win08 / Win7.

To do this we create an etw trace using at netsh scenario and then view the resulting etl file in Event Viewer.

You can list the netsh trace scenarios like this:

PS C:\tmp> netsh trace show scenarios

Available scenarios (18):
——————————————————————-
AddressAcquisition : Troubleshoot address acquisition-related issues
DirectAccess : Troubleshoot DirectAccess related issues
FileSharing : Troubleshoot common file and printer sharing problems
InternetClient : Diagnose web connectivity issues
InternetServer : Troubleshoot server-side web connectivity issues
L2SEC : Troubleshoot layer 2 authentication related issues
LAN : Troubleshoot wired LAN related issues
Layer2 : Troubleshoot layer 2 connectivity related issues
MBN : Troubleshoot mobile broadband related issues
NDIS : Troubleshoot network adapter related issues
NetConnection : Troubleshoot issues with network connections

Lets start the trace (this grows fast so don’t let it run too long)

PS C:\tmp> netsh trace start scenario=NetConnection

Trace configuration:
——————————————————————-
Status: Running
Trace File: C:\Users\neverending\AppData\Local\Temp\NetTraces\NetTrace.etl
Append: Off
Circular: On
Max Size: 250 MB
Report: Off

Now we start our app with keepalives enabled, what we are looking for here is the socket option command.

Stop the traced (Warning about Kernel Logger is because I have ProcExp running)

PS C:\tmp> netsh trace stop
Correlating traces … done
Warning: An instance of the ‘NT Kernel Logger’ is already running.
System information will not be added to the trace file.
Generating data collection … done
The trace file and additional troubleshooting information have been compiled as “C:\Users\neverending\AppData\Local\Temp\NetTraces\NetTrace.cab”.
File location = C:\Users\neverending\AppData\Local\Temp\NetTraces\NetTrace.etl
Tracing session was successfully stopped.

Open the trace in event viewer and look for event 1105.

Socket option being enabled

Socket option being enabled

You could also do a netmon packet capture. With color troubleshooting filters (these come with the latest parsers) the keepalive is highlighted and are usually easy to find given their regular time interval.

Keepalives and the Firewall

Recently I helped move a customer to a new datacenter. New for application support guys was that in the DC their app, and it’s inter-connecting systems, were separated by a real FW. I say “real” FW because the app guys will say … we have always be traversing firewalls, so why doesn’t my app work now? The difference in this case was that the DC FW did stateful inspection but in the old environment router ACLs were used.

The problem that arose was intermittent connectivity failures which started showing up in various app logs and user tickets. With some packet captures and FW log analysis it was clear that this was a case of long running idle (no packets) tcp sessions getting dropped by the FW after 1 hour of inactivity. The failure was intermittent from apps guys perspective as it depended on how much user activity there was. I should also add that when we started looking we found that not all connection drops by FW were detected, some apps just auto created a new connections and were using more and more resources in the process, check your FW log.

Thisscenario, connection drops by the FW is a common problem. There are a couple of possible fixes. One is to increase the connection timeout parameter on the FW from 1 hour (the default on most) to something like 4 hours or whatever your app and its activity require. This effectively keeps the tcp connection nailed up for 4 hours after the last packet seen by the FW. This isn’t a good solution since it means more memory usage on the FW as the table of known connection isn’t trimmed as much and also there is a security issue (DoS I think) with increasing the timeout.  I keep the FW connection timeout adjustment as a last resort. A better fix is to enable keepalives on the connection.

A keepalive is an empty packet sent at a regular interval when there is no app activity. The empty packet resets the tcp session idle timer on the FW and session stays alive (not dropped) even though there is no actually application traffic. Now there are two ways of generating a keepalive packet, the app could do in its code or the app can ask the OS TCP code to send the keepalive packets on app’s behalf using the socket option keepalive – SO_KEEPALIVE. This is easy to see in the putty app as it supports both methods.

Putty_keepalive

But wait there is more. When you configure your app to use SO_KEEPALIVE the frequency is in the OS config and this is typically 2 hours by default. So, after two hours the keepalive packet is sent but the FW has already dropped the connection. You need to change the interval to something like every 15 minutes. The 2 hour frequency originates in some ancient RFC back in a time when bandwidth was very limited and these keepalive packet were considered unnecessary traffic.

Bottom line is tell the app guy to enable keepalives. Also tell them to search the app vendor KB articles for keepalive and firewall before coming back with the – its not supported – answer. This is a common issue and most vendors have customers that experience it.

Fast filtering netmon captures

Last week we talked about setting up a long running capture. In my case I had thousands files in a few days. Fortunately I had an extra 500GB disk to put them on. But now came the challenge of finding what I was looking for in those file.

I had the time of error so I would open a few captures around that time and look for an error. By error I mean a tcp reset. In netmon that is easy to find as you can pick each tcp session from the tree on the left and go to the last frame, see a tcp reset with no FIN then bingo. BTW be sure to have troulbleshooting color filters on, this makes finding problem packets easy. I think the latest parser package comes with some good color filters.

In my case it was pretty straightforward, there was a series of retransmits and then a connection reset. What you expect after five unanswered retransmits. The problem now was how to find and re-assembly the entire tcp session so I could see what was negotiated in the tcp setup. This is fast and easy with netmon blob filters.

Using nmcap with blob filters the capture file can be search in a couple of seconds. A blob filter is a hex pattern and length at a certain offset. So, lets assume that the ephemeral port number in the tcp session that was reset is  53487, or in hex 0xDOEF. My filter then would be Blob(FrameData, 34, 2) == 0xD0EF OR Blob(FrameData, 36, 2) == 0xD0EF where 34 is the offset in one direction (src to dst) and 36 is the offset in the other direction. The other number, 2, is the length. The filter is easy to build up using the Frame Details and Hex Details panel in netmon.

Once you have your blob filter (tested it in the gui) you are ready to run it. Lets assume you are interested in looking through files capfile(2735) to capfile(2765), 30 files. Then with the help of a batch for loop you would use something like this:

for  /l  %i in (2735,1,2765) do nmcap /UseProfile 2 /inputcapture e:\cap\capfile(%i).cap /disableconversations /capture Blob(FrameData, 34, 2) == 0xD0EF OR Blob(FrameData, 36, 2) == 0xD0EF  /file c:\tmp\error24\Oct25\Oct25_1132pm_port53487_%i:2G

Now I have a series of small files with just traffic to/from tcp port 53487. With this cmd I can stitch the files together to get one cap file that I can open in the netmon gui.

nmcap /UseProfile 2 /disableconversations /inputcapture Oct25_1132pm_port53487_(2762).cap Oct25_1132pm_port53487_(2763).cap Oct25_1132pm_port53487_(2764).cap Oct25_1132pm_port53487_(2765).cap /capture /file allPort53487_Oct25_1132pm.cap:2G

If I have a lot of files to stitch together then I usually put the dir /b output of the files in to Word  and do a find ^p and replace with ^s to remove all the hard returns. That puts them on a single line, ready for pasting into my cmd.

You do need a parser enabled for the Blob filter to work I have use the High Performance one in profile 2 above, Pure Capture won’t work.

Blob filters a great. I also use them for wild carding by change the length param. For ex. to and from a certain IP but with a length of 3 will give all the traffic from the subnet. 

Most of this I learned from the Netmon blog.

http://blogs.technet.com/b/netmon/archive/2010/08/05/using-high-performance-filtering.aspx

http://blogs.technet.com/b/netmon/archive/2009/09/09/chained-captures-and-stitching-them-back-together.aspx

 

Long running captures with netmon

Occasionally you need to setup up your packet capture to run for days to catch that elusive intermittent problem. I recently worked on a backup job failure problem that appeared to be related to a network error. This error occurred about once a week, fortunately its quite easy to use netmon to make long running captures on Windows.

The best tool for this nmcap, the cmd line version of netmon – its installed with netmon. You will of course need the appropriate amount of disk space too (make some trial runs to get an estimate). Basically I usually setup nmcap to create a series of 64MB files, start nmcap with a scheduled task and keep either the entire frame or only headers depending on avail. disk and what I am looking for.

For example something like this:

nmcap /network 1 /useprofile 1 /mindiskquota 200M /DisableLocalOnly /DisableConversations /MaxFrameLength 68 /capture  /file e:\cap\captureFile.chn:64M

You will need to specify your network (nmcap /displaynetwork) and I always use pure capture profile (nmcap /displayprofiles). I prefer to capture all the traffic (no capture filter) then filter it later, this has less overhead in terms of capturing, is less likely to drop packets and if you need some related traffic (DNS for ex.) you didn’t think you would need you have it. This last point is important because you need to wait another week if you don’t have what you need. The switches in this example are pretty self explanatory, DisableLocalOnly puts the NIC in promiscuous mode (cap all traffic it sees, not just brdcast and traffic addressed to the NIC), create a series of 64MB files, only capture the first 68 bytes.

A couple of tips. Make sure the time on your netmon host is synch’d with the system that reports the error. So if NetBackup writes an error at 1:11:35AM you have an idea where to find it. Also be aware that if you are using a Scheduled Task to run nmcap when you stop the task (capture) it will corrupt the current cap file. Basically that means don’t stop the Task until you are sure the error is in written out to a capture file.

The Tools

Here are my favorite tools.

  • MS netmon 3.4 – This is a great tool because it makes it so easy to view TCP sessions. Also it easy to filter and do long running captures.
  • Wireshark – I typically use Wireshark for converting tcpdump files in to netmon format. It also good for identifying lower level errors – IP or ARP for example. And it has parsers for Oracle TNS so when working on Oracle DB problems load you capture file here for hints. Stumped, load your cap file here and netmon, they both bring value.
  • MS Research TCP Analyzer – this tool is an add on to netmon and is great for viewing the flow control of a tcp session. Good for perf analysis and getting an understanding of flow rate, this is helpful for detecting if net congestion is detected and the sender is throttling or the app doesn’t have any data to send or the receiver can’t eat the data fast enough.
  • MS TCP Analyzer – yes there are two tools from MS called TCP Analyzer, this one comes in the Windows SDK. This is a handy standalone, xcopy install, tool that uses the estat mib data in Vista8 and above. Its easy – no net capture required and you get all the tcp flow control info. Great for repeatable tests. This is where you start when you trying to copy a large file across the WAN and aren’t getting the throughput you expect. 
  • Netstat – yes the one built in to the OS. This is handy when the app guy calls and suspects he has a connectivity problem. No tool to install, app guys can run it, no admin priv required. I tell them to run this cmd – netstat -an 3 | findstr “SYN” , then try their connection. If they see the TCP SYN then its blocked and can’t move to established state. You also get to see the IP used in the connection, something not always obvious in a NAT and multi interface env.
  • MS Resource Monitor aka resmon – The network tab is pretty handy and provides a lot of info. More that you can get with perfmon. Shows Process – TCP session – throughput. Plus you can us the Disk tab to see where the process is writing the data.