Wireshark + Rockwell

WillM · May 9, 2016

I have been having ongoing network issues now for a few weeks where we loose comms from a L71 Processor to the following...

1734 - AENTR Remote IO Rack
1756 - ENBT + L61 Processor

Comms will regain immediately and all that's left is a trail of alarms on the Scada highlighting the above errors. The alarms are triggered via the fault status values of each card.

After checking all the relevant web pages and the network we have rectified some problems but we are still getting this comms issue. Rockwell have suggested we use Wireshark to analyse the data.

It's all fairly new to me but I was able to capture the traffic between the L71 and L61 today. One of the first things that jumps out is "Forward Close", "Forward Opens" and "Connection Failures" around the time of the issue.

Can anyone advise on what the next step to analyzing these packets would be?

Thanks in advance

Will.

daba · May 9, 2016

WillM said:
I have been having ongoing network issues now for a few weeks where we loose comms from a L71 Processor to the following...

1734 - AENTR Remote IO Rack
1756 - ENBT + L61 Processor

Comms will regain immediately and all that's left is a trail of alarms on the Scada highlighting the above errors. The alarms are triggered via the fault status values of each card.

After checking all the relevant web pages and the network we have rectified some problems but we are still getting this comms issue. Rockwell have suggested we use Wireshark to analyse the data.

It's all fairly new to me but I was able to capture the traffic between the L71 and L61 today. One of the first things that jumps out is "Forward Close", "Forward Opens" and "Connection Failures" around the time of the issue.

Can anyone advise on what the next step to analyzing these packets would be?

Thanks in advance
Will.

You most certainly need "I/O OK" and no warning triangles in your I/O Configuration.

That will rule out I/O module connection failures, and bad Produce/Consume tag configurations.

Are you using messaging ? Are they all succeeding (.DN) ? Are you re-triggering them immediately ?

Also check the RPI's of your I/O. Setting this too fast can cause communications issues.

daba · May 9, 2016

Also use the diagnostics pages of the ENBT/EN2T web-pages (just type the relevant IP address into your web-browser).

Check you don't have excessive "connections"

Archie · May 9, 2016

WillM said:
After checking all the relevant web pages and the network we have rectified some problems but we are still getting this comms issue. Rockwell have suggested we use Wireshark to analyse the data.

It's all fairly new to me but I was able to capture the traffic between the L71 and L61 today. One of the first things that jumps out is "Forward Close", "Forward Opens" and "Connection Failures" around the time of the issue.

Can anyone advise on what the next step to analyzing these packets would be?

What precedes the Forward Close? If you filter to that IP address, is there a long delay (about 20 seconds or so) before a successful packet is transmitted?

Ken Roach · May 9, 2016

One of the first things that jumps out is "Forward Close", "Forward Opens" and "Connection Failures" around the time of the issue.

What you're seeing are probably symptoms, not necessarily the underlying problem. When ControlLogix modules see a timeout condition they try to close the failed connection and open a new one.

So as Archie suggests, you need to look upstream of those captured messages for evidence of the problem.

One technique I like to do is to filter out the I/O traffic being produced by an I/O device and change the time display to "time between packets". That helps you visualize delays and inconsistencies in network performance.

You can also look at the transaction ID value for CIP I/O connections, and look for skipped transaction ID numbers. That takes some good eyes (or some script automation) but can also help you find missing packets.

trentg@felixconstruc · May 9, 2016

I have had similar problems and it turned out to be too many CIP connection. Check the embedded webpage as others have suggested. I found out through much help and reading that some devices consume more than one connection (PV+ is 2 I think). We were right at 32 connection and every time we started a temporary process or one of us would go On-Line we would fill our WW error log with loss connection faults.

WillM · Aug 18, 2016

Apologies for the late response to this, thanks for the replies.

The fault is still ongoing and I'm looking into it again. We've had multiple contractors look into it and have tried some some solutions with no joy.

Firstly everything on the web pages appear to be ok. Media counters, CPU usage and connection limits are fine. The only obvious show of a problem is where the connection has re established its self X amount of hours ago.

So I'm back to looking over the wire shark sample. What I'm noticing is packets are transmitting fine until i get a forward close (180218), shortly after there is "[TCP Retransmission] forward close" and this is where i start to see significant times.
Its a full 10 seconds before communication re establishes its self.

I've attached a screenshot to help show it better.
- 10.0.40.200 is the Master PLC which is alarming
- 10.0.105.0 is the ENBT its trying to communicate with

Also logging on historian shows that the fault code generated is the common 16#0203

Cheers,
Will.

dmroeder · Aug 18, 2016

What about the "Bridged Connections" tab on the ENBT's web page? Look for "Missed Packets", these should be zero. I've had to troubleshoot a lot of these similar issues recently and from my experience, any missed packets have almost always been a result of mismatched baud rate settings. In rare instances, a faulty cable has been an issue but there were other symptoms as well (FSC errors).

Knowing the network architecture can be helpful.

I'll give an example of something I ran into recently. 12 Powerflex drives all connected to an unmanaged switch. One cable connecting this unmanaged switch to a managed switch. The port it was connected to on the managed switch was set to forced 100MB Full Duplex. All the drives but 2 were set the same. The two that were different were set to Autobaud. Lots of missed packets was the result and periodic drop-outs. Setting everything to autobaud cleared it all up. No more missed packets and no more dropped connections.

Ken Roach · Aug 18, 2016

The screenshot is interesting, but the actual Wireshark data is required to do real investigation. If you can save twenty seconds of all the data before and after the failure, that would be ideal.

The fact that there are TCP retransmits (black with red text) suggests that there is a physical loss of packets going on.

With these IP filters on, we don't see the failure of the connection between the 1756-L71 and the 1734-AENT, so we don't know what device failed first.

Are there Produced/Consumed Tags between the L61 and the L73 ? Is any of the I/O in the remote chassis with the 1756-L61 and 1756-ENBT module shared or owned by the 1756-L73 ?

It's interesting that 10.0.105.0, the remote 1756-ENBT with the 1756-L61 in the chassis, is the one that sends the Forward_Close command. Generally the connection initiator is the one that opens and closes the connections, and that's generally the "Scanner" for I/O or the "Consumer" for Produced/Consumed Tags.

WillM · Aug 18, 2016

dmroeder said:
What about the "Bridged Connections" tab on the ENBT's web page? Look for "Missed Packets", these should be zero. I've had to troubleshoot a lot of these similar issues recently and from my experience, any missed packets have almost always been a result of mismatched baud rate settings. In rare instances, a faulty cable has been an issue but there were other symptoms as well (FSC errors).

Knowing the network architecture can be helpful.

I'll give an example of something I ran into recently. 12 Powerflex drives all connected to an unmanaged switch. One cable connecting this unmanaged switch to a managed switch. The port it was connected to on the managed switch was set to forced 100MB Full Duplex. All the drives but 2 were set the same. The two that were different were set to Autobaud. Lots of missed packets was the result and periodic drop-outs. Setting everything to autobaud cleared it all up. No more missed packets and no more dropped connections.

Some interesting thoughts there that I'll investigate when I'm back on site in the morning.

I have ran into that before with 5/05 Ser B's which can only be set to 10 Mbps. We configure everything to Auto Negotiate but something may be incorrect on this network.

The architecture is a linear network with 3 managed Cisco switches, spanning to un-managed Hirschman switches. 22 Processors with light communication via Produce & Consume or messaging. I'll get some condensed drawings up tomorrow highlighting where the problems lie.

Thanks.

WillM · Aug 18, 2016

Ken Roach said:
The screenshot is interesting, but the actual Wireshark data is required to do real investigation. If you can save twenty seconds of all the data before and after the failure, that would be ideal.

The fact that there are TCP retransmits (black with red text) suggests that there is a physical loss of packets going on.

With these IP filters on, we don't see the failure of the connection between the 1756-L71 and the 1734-AENT, so we don't know what device failed first.

Are there Produced/Consumed Tags between the L61 and the L73 ? Is any of the I/O in the remote chassis with the 1756-L61 and 1756-ENBT module shared or owned by the 1756-L73 ?

It's interesting that 10.0.105.0, the remote 1756-ENBT with the 1756-L61 in the chassis, is the one that sends the Forward_Close command. Generally the connection initiator is the one that opens and closes the connections, and that's generally the "Scanner" for I/O or the "Consumer" for Produced/Consumed Tags.

I'll PM you a link to the full capture if your willing to give it a look over

The 1734--AENTR is on another port from the managed switch so I'm not sure how great the trace will be to determine info for that.

Between the L71 and L61 there is only Produce and Consume data both ways and no IO ownership. Could that also be why the L61 closes the connection as its 2 way?

Thanks.

Ken Roach · Aug 18, 2016

Fun Wireshark / EIP trick of the night: Sequenced IO Objects

When two devices in the RA world have a cyclic I/O connection, like a Logix/ENBT and an AENT adapter, you can imagine the data exchange like a water balloon fight: each side is throwing data at the other one at a rate determined by the RPI.

When it's a unicast connection, the water balloons are aimed carefully; the destination address is the IP of the target, e.g. 10.0.40.200.

When it's a multicast connection, the water balloons are aimed in the general direction of the target, with a Multicast address, e.g. 239.192.33.0.

Each one of these water balloons has a number on it, called a Sequenced Address Item number.

When the water balloon fight commences, everybody starts with Balloon 0, and increments by 1 every time they throw a balloon.

You can estimate how long a connection has been running by multiplying the SAI number by the RPI. In the screenshot posted previously we see some Sequence Numbers around 32966, at what looks like an RPI of 200 milliseconds, meaning the connection has been running for about 1 hour 49 minutes.

Because we know that every cyclic I/O connection begins with a packet with sequence number 0, we can look for newborn connections with this Wireshark filter :

enip.cpf.sai.seq == 0

In this capture, we see 17 different connections break and re-make between 4 different devices over a 9-second period starting at 241 seconds into the capture.

Ken Roach · Aug 18, 2016

Okay, Will, check out these Wireshark filters.

They're going to show you what I think are Produced Tags from 10.0.40.200 (your 1756-L73) which are consumed by 10.0.105.0 (your 1756-L61).

enip.cpf.sai.connid == 0x007dba03
enip.cpf.sai.connid == 0x007db802
enip.cpf.sai.connid == 0x007dbe07
enip.cpf.sai.connid == 0x007dc013

The size and RPI of those, respectively, are:

62 bytes, 0.500 seconds
110 bytes, 0.200 seconds
122 bytes, 0.100 seconds
170 bytes, 0.100 seconds

When you look at the first one, set the Wireshark Time display to "time since last displayed packet".

And whammo... packet number 182936, a full 5.5 seconds after the previously seen packet. The proverbial smoking gun. 4x the RPI with no data is a timeout.

What that's like is when you suddenly stop getting hammered with water balloons in your water balloon fort. You expected a water balloon every 0.500 seconds, but it's been more than four times that long. So you stand up and yell "CIP Connection Forward Close !" and get pasted in the face with a water balloon.

What does this happen ? I don't know. But it's getting more interesting.

When you use this Wireshark filter:

frame.len == 62 && ip.src == 10.0.40.200

You'll see there at packet 182936 where the long delay is, that there's a gap in the captured Sequence numbers. They go from Sequence ID 13196 to 13207, meaning that 11 packets were sent by the L73 and never seen on the network.

What's interesting is that the CIP Forward Close is sent at packet number 180218, so it's after the timeout has occurred from the perspective of the L61, but before those last two straggler packets arrive from the L73.

Firejo · Aug 19, 2016

Just a thought, are the two processors connected to each other via wireless and/or a managed switch? Is it possible that there is a device in the network that is losing packets?

daba · Aug 21, 2016

Just a thought.....

You are using 10.0.105.0 IP for the L61

I would check that there aren't any devices on the network with a 255.255.255.0 SubNet Mask.

255 in the third octet makes 0 in the fourth the "default" IP address in 10.0.105.xxx subnet, the address to which unrouted packets are sent.

Wireshark + Rockwell

WillM

Member

daba

Lifetime Supporting Member

daba

Lifetime Supporting Member

Archie

Member

Ken Roach

Lifetime Supporting Member + Moderator

trentg@felixconstruc

Member

WillM

Member

dmroeder

Lifetime Supporting Member

Ken Roach

Lifetime Supporting Member + Moderator

WillM

Member

WillM

Member

Ken Roach

Lifetime Supporting Member + Moderator

Ken Roach

Lifetime Supporting Member + Moderator

Firejo

Member

daba

Lifetime Supporting Member

Similar Topics