MSG timeout - wireshark shows constant retransmissions

ASF

Lifetime Supporting Member
Join Date
Jun 2012
Location
Australia
Posts
3,907
Hi all,


Got a customer with a control system architecture like this:
pCn9D1T
u1F3MC8.jpg


The L71 uses explicit messaging to read and write to and from the ML1400. It's been in place and working fine for about 4 years. The L71 also talks to one other ML1400 with more or less identical architecture, and about 5-6 CompactLogix/ControlLogix PLC's using produced/consumed tags, again with similar physical infrastructure. The ML1400 doesn't talk to anything else.

About 4 months ago, they started getting sporadic comms dropouts between the L71 and the ML1400. There seemed to be the occasional dropout to the other ML1400 (not shown) as well, but none to the PLC's on produced/consumed, and the ML1400 in the picture was definitely the main culprit. It got worse and worse until I eventually had them disconnect the uplink between the Stratix 5700 and the unmanaged switch in the ML1400 cabinet, and string a patch cable directly between the two unmanaged switches, just to get them out of immediate trouble. That worked, and they've been running like that for about a month.

I finally got an opportunity to go out there and do some proper diagnostics. I re-patched the network as drawn above, and everything worked (as I mentioned, it was an intermittent problem). I connected my laptop to the stratix as shown, and set it to port mirror the port going to the ML1400, and then the port going to the L71. I took a wireshark capture of each port.

The port going to the ML1400 seemed normal. I'm very inexperienced with wireshark, so I'm definitely not able to be certain, but everything looked OK.

The port going to the L71 looked OK until I filtered it for traffic to/from the problem ML1400 only. When I did that, every single entry was followed by a retransmission. Every single one. This whole time, the comms was working just fine, but retransmissions abound!

Here's what I don't get. If we had a cabling problem from the stratix to the micrologix, I'd expect to see that sort of symptom - try to transmit, receive no response, try again. But I'd expect to see that symptom on the port going to the micrologix, not the port going to the L71.

If we had a cabling problem on the port going to the L71, I'd expect to see that sort of symptom - but I'd expect to see it on all devices, or at least, more than one. Filtering for all other devices shows no such thing.

Is there anyone more knowledgeable than I able to shed some light on what I'm seeing?

The L71 is 192.168.96.11, and the ML1400 is 192.168.96.70. Here's the capture from the port going to the ML1400:
DJZCeSL
IiauxGD.png


Here's the capture from the port going to the L71:
doTUgiJ.png


u1iBKxS
 
Last edited:
What that looks like to me is that the MicroLogix port has been set to have its traffic mirrored on the ControlLogix port, and then the ControlLogix port has been set to mirror its traffic to your monitoring port.

Or. one of the ports has been mis-configured as a VLAN trunk, or something else that would end up with it trying to carry both its own traffic and the traffic of another port.

My reading of the capture is that it can't be an actual re-transmission by the ControlLogix, as the IP source addresses for the packets are both the ControlLogix and the MicroLogix.
 
EtherNet Media Counters...

What I believe we are looking at here is the traffic through the switch ports. So we see packet data both ways...

Note: Unknown Service (0x4b) = Execute PCCC Service (SLC Typed Read/Write)

ML1400 Switch Port Trace...

L71 > ======== < ML1400

Protocol¦Length¦Service > = < Length¦Reply

CIP¦117¦PCCC > = < 189¦"Success" Reply
CIP¦197¦PCCC > = < 109¦"Success" Reply

TCP¦60¦Keep Alive > One Way

CIP¦117¦PCCC > = < 113¦"Success" Reply
CIP¦121¦PCCC > = < 109¦"Success" Reply

TCP¦60¦Keep Alive > One Way

CIP¦117¦PCCC > = < 189¦"Success" Reply
CIP¦197¦PCCC > = < 109¦"Success" Reply

TCP¦60¦Keep Alive > One Way

CIP¦117¦PCCC > = < 113¦"Success" Reply
CIP¦121¦PCCC > = < 109¦"Success" Reply

...and so on...

This suggests the PCCC command requests are exchanging OK between the Stratix 5700 and the ML1400, as you have understood to be the case.

===================================

L71 Switch Port Trace...

L71 > ======== < ML1400

Protocol¦Length¦Service > = < Length¦Reply

CIP¦117¦PCCC > = Failed?...
CIP¦117¦PCCC > = Retransmission...Success
CIP¦117¦PCCC > = < 113¦"Success" Reply - Failed?...
CIP¦117¦PCCC > = < 113¦"Success" Reply - after Retransmission
CIP¦121¦PCCC > = Failed?...
CIP¦121¦PCCC > = Retransmission...Success
CIP¦121¦PCCC > = < 109¦"Success" Reply - Failed?...
CIP¦121¦PCCC > = < 109¦"Success" Reply - after Retransmission

TCP¦60¦Keep Alive > One Way

...and so on...

This suggests that the PCCC command requests are not exchanging OK between the L71 and the Stratix 5700. There are retransmissions in both directions.

I would look to the 1756-ENBT and Stratix 5700 EtherNet Media Counters, which you can access via their built-in web browser. These counters are another good software "port" of call when suspecting EtherNet/IP communications issues. Look specifically for Alignment, FCS or Collision errors, but anything that sticks out really.

A prime suspect on the list here would be a Baud Rate/Duplex mismatch (10 Mbps Half <>100 Mbps Full, etc.). Also, once you have unmanaged switches in the mix you should have all ports configured to Auto Negotiate. The 1756-ENBT and Stratix 5700 are default Auto Negotiate out-of-the-box. But perhaps one of them was or has been forced?

Regards,
George
 
Last edited:
Thanks for the responses.


Just checked, and all ports on the switch, and the ENBT, are set to auto negotiate and are running at 100Mbps Full Duplex. I've checked the port settings for anything that looks like what Ken suggested might be happening, but can't see anything obviously out of the ordinary. Error counters on both the switch and the ENBT show no errors at all. It's worth noting that I flashed the firmware on the ENBT earlier that day, from 6.004 to 6.006 - but the wireshark traces showing all the retransmissions were taken after that.


It's bugging me that the issue appears to be between the ENBT and the Stratix - and yet, this issue is only presenting on one PLC out of 8 or 10 PLC's that this L71 talks to. Many of the other PLC's talk via produced/consumed tags, which should flag a timeout much, much faster than the 30 second timeout of a MSG instruction. But they haven't missed a beat this whole time. I've also filtered the wireshark trace by retransmissions to/from any address other than the ML1400, and it comes up blank. There are zero retransmissions to any other PLC.


Is it possible that something to do with the way I trigger my MSG instructions could cause this sort of response? Triggering too fast, or a mis-configuration somewhere in the PLC?


Otherwise, I guess my next option is to swap ports on the Stratix with one of the other ML1400's and see if the problem follows the PLC or the port. That's just difficult because I don't get may opportunities to take this line down to do tests like that.
 
I my guess about a mis-configured port is correct, then putting the MicroLogix 1400 on a different port should change the behavior.

Unless there's a group of ports and this is a VLAN trunk misconfiguration... troubleshooting while a system is operating is a challenge !
 
What's the 5700 actually doing? Also, I would go back to that first step and check the MSG code. Are there any mechanisms in place to re-issue the instructions automatically? Are the MSG instructions only being triggered once?
 
What's the 5700 actually doing? Also, I would go back to that first step and check the MSG code. Are there any mechanisms in place to re-issue the instructions automatically? Are the MSG instructions only being triggered once?

That's a good point. Maybe it's a PLC timing issue instead of a networking issue.
 
Frames identified by Wireshark as TCP retransmissions are literally the same frame as another recent one, byte-for-byte.

There is no chance that these are re-tries of a message by the 1756-L71/EN2T. There's no timeout or error reply involved, and any subsequent message would at the very least have a new transaction sequence ID number.

I'll bet you every dollar in my pocket that those packets are being shown to Wireshark by the switch, not by one of the devices involved.

If I were troubleshooting this, I would put a sniffer on the link between the switch and the ControlLogix; I have a Frontline Test Equipment intercept box for exactly this purpose. It would be instructive to see if those TCP retransmissions actually appear on the wire or if they are an artifact of the mirror function inside the switch.
 
Frames identified by Wireshark as TCP retransmissions are literally the same frame as another recent one, byte-for-byte.

There is no chance that these are re-tries of a message by the 1756-L71/EN2T. There's no timeout or error reply involved, and any subsequent message would at the very least have a new transaction sequence ID number.

I'll bet you every dollar in my pocket that those packets are being shown to Wireshark by the switch, not by one of the devices involved.

If I were troubleshooting this, I would put a sniffer on the link between the switch and the ControlLogix; I have a Frontline Test Equipment intercept box for exactly this purpose. It would be instructive to see if those TCP retransmissions actually appear on the wire or if they are an artifact of the mirror function inside the switch.

I've never ran Wireshark in this position. I always run it on the windows server that the actual windows service is running on. I see what you mean.
 
This sure looks like a span or mirror port setup issue. One way to tell if the suspected retransmissions are real (i.e. the sending host has timed out and retransmitted the data) or if the network is replicating the frames is to look at the ip.id field:

2018-12-11 06_06_49-iperf - Copy.png

This typically varies with every IP packet sent by a host; it is a 99% test - most stacks increment this with every frame sent, but technically its not required so some don't. Easy to test it, though - just look and if it appears to be incrementing when communications are healthy, then you know the host behavior. Anyway, in this case, a true TCP retransmission from a host will have an incremented ip.id field; transport layer (TCP) is above network layer (IP) so as far as IP is concerned, it's a new packet. If incrementing is in use, if this field is NOT incremented we can conclude that the network is replicating the packet. If it is incremented, you would know that the host really sent the data again. This also implies that the whole frame is not a byte-by-byte copy as the id can increment, so would the ip checksum in the case of true retransmission.

In this case there are some clues:

1. Every packet, in both directions, is replicated
2. The time between displayed replicate is VERY short - 1 microsec.
3. Verify the ip.id field (not shown in your picture)

In any case, even if it is switch misconfiguration, if a host is presented with this (i.e. not the capture host, but the actual device in the communication path) then the replicated bytes would be dropped and the application would never see them. It's still harder to analyze the trace when configured this way, but it would still work.
 
There is no chance that these are re-tries of a message by the 1756-L71/EN2T. There's no timeout or error reply involved, and any subsequent message would at the very least have a new transaction sequence ID number.

I'll bet you every dollar in my pocket that those packets are being shown to Wireshark by the switch, not by one of the devices involved.
This sure looks like a span or mirror port setup issue. One way to tell if the suspected retransmissions are real (i.e. the sending host has timed out and retransmitted the data) or if the network is replicating the frames is to look at the ip.id field...In this case there are some clues:

1. Every packet, in both directions, is replicated
2. The time between displayed replicate is VERY short - 1 microsec.
3. Verify the ip.id field (not shown in your picture)
The IP identification field is identical between the initial transmission and the retransmission, and increments between each "pair" of transmissions. e.g.:
TCP: 0xaab1
TCP (Retransmission): 0xaab1
CIP: 0xaab2
CIP (Retransmission): 0xaab2
CIP: 0xaab3
CIP (Retransmission): 0xaab3

Then there's a gap (as the L71/ENBT then moves on to communicate with several other PLC's before returning to this one). It's not always that "neat" - sometimes the ID's aren't in the exact "increment by one each time" sequence, but ultimately I guess the critical point here is that the original transmission and the retransmission always have the same ID. So as the both of you suggest, it's got to be the switch doing the retransmissions, and not the devices.

Perhaps the following other tidbit of information is relevant...
Ordinarily, the way I would approach this test is as follows. Let's assume the ML1400 is in port 1, and the L71/ENBT is in port 2.
1. Connect laptop to port 3, configure port 4 to mirror port 1
2. Connect laptop to port 4, run wirehshark trace
3. Connect laptop to port 3, configure port 4 to mirror port 2
4. Connect laptop to port 4, run wirehshark trace
In between each test, I'm disconnecting from the mirror port. But, in this case, I had a colleague onsite with me, so to save all the re-patching, I had him connect to port 3 and make the switch configuration changes, while I remained connected to port 4 the whole time. So. Using the web interface for the switch, I don't know of any way to set a port up to mirror two ports at once. I don't think it's possible. But perhaps when I switched the mirror source port, because my laptop was never disconnected from the mirror destination port, it never properly "cleared" the first configuration, and so ended up mirroring data from both ports? If that's the case, I could prove it by running the test again, but mirroring the ports in the reverse order.

In any case, even if it is switch misconfiguration, if a host is presented with this (i.e. not the capture host, but the actual device in the communication path) then the replicated bytes would be dropped and the application would never see them. It's still harder to analyze the trace when configured this way, but it would still work.
To make sure I understand you correctly - you're saying that if the Micrologix 1400 was actually receiving these duplicate transmissions, it would just ignore the duplicates and carry on normally? Is there any possibility that dealing with the constant barrage of duplicate packets could cause it to intermittently overload/run out of buffer space/etc and drop communications until power cycled? Based on the above, I think its more likely as Ken has suggested that it was purely a mirror port configuration issue and none of the retransmission packets were actually getting "down the wire", but on the off chance they are is it worth considering that as a possible cause?

If I were troubleshooting this, I would put a sniffer on the link between the switch and the ControlLogix; I have a Frontline Test Equipment intercept box for exactly this purpose. It would be instructive to see if those TCP retransmissions actually appear on the wire or if they are an artifact of the mirror function inside the switch.
Is this the device you use? It sounds like a useful tool to have, but the price is a little hard to swallow when I don't generally do a lot of this sort of troubleshooting. That said, maybe if I had the proper tools I'd end up doing a lot more!
 
Gotta love it when the act of observing at a problem actually causes it. Quantum effects writ large.

Thanks for the link to the monitoring switch. I used to have an old HP hub for that but it eventually croaked it.
 
So. Using the web interface for the switch, I don't know of any way to set a port up to mirror two ports at once. I don't think it's possible.

I don't know the Stratix series, but I do know the usual Cisco IOS platforms, as well as others and this is actually a common issue - having two source ports set to a single destination like this. From the Stratix user manual:

You can configure port mirroring on only one port via Device Manager.
However, you can configure multiple ports via the CLI.

Anyway, Wireshark ships with a tool to handle this case (as I said, it is common). If you are interested, the editcap command will remove duplicates but I never like doing this - I always worry about throwing out the baby with the bath water. Prefer to fix the capture system, not guess that I have duplicates then remove them ex post facto.

if the Micrologix 1400 was actually receiving these duplicate transmissions, it would just ignore the duplicates and carry on normally?

Correct; TCP provides for a reliable byte stream that is in order. Duplicate bytes (every data byte in a TCP stream is sequenced, so it is known if that sequence has been observed) would be dropped by the TCP stack and never sent to the application. Note this is what TCP provides; if you are using UDP, it would not work this way.

Is there any possibility that dealing with the constant barrage of duplicate packets could cause it to intermittently overload/run out of buffer space/etc and drop communications until power cycled?

This would be a 'packet storm'; i don't recall the limits of the device off hand, but I doubt what you show is enough to be considered a storm. The concept of storm is relative - too much traffic can cause problems and too much traffic depends on the device; a modern PC can handle a lot more than a small embedded device. I would move this possible root cause down in priority and focus elsewhere first; but who knows - products have defects...

For general capture use, I started buying these for my team:

https://www.tp-link.com/us/products/details/cat-5711_TL-SG105E.html

You can set a mirror port so acts just like a tap; and on Amazon, it's less than $30US. But it needs a power supply / wall wart. Some of those other ones are USB powered, so can run off a laptop which can surely be convenient sometimes.

For serious use, if I can't get a span port out of the infrastructure, I had to stop using the FTE device. I found this to be far superior and very easy to use:

https://www.profitap.com/profishark-1g/

But it is in the professional category - it's like $3000 or so.
 

Similar Topics

I've got a Controllogix system that is doing a MSG Read from an SLC500 PLC and over the past week we have had multiple issues with the MSG timing...
Replies
11
Views
5,625
Hey guys, I'm seeing some odd behavior with some MSG instructions I have set up. Specifically, when I enable the MSG instruction I nearly...
Replies
5
Views
5,960
I have a client who periodically experiences network communication issues. Sometimes when I VPN into the site, their SCADA systems will flash comm...
Replies
2
Views
60
I'm trying to read/write to an SLC5 with a ControlLogix L71 V35 plc that fails. The exact same code works on an L82S with V32. Is there a known...
Replies
9
Views
118
Hello all, I am currently trying to establish a message instruction to pass data from a 1756-L73 to a 1756-L71. My communication path from the...
Replies
8
Views
192
Back
Top Bottom