Network Issue with Wire Fault

Maxkling

Member
Join Date
Mar 2011
Location
Atlanta
Posts
511
We have a very basic but large network. The network consists of a CPX L32e and Magelis connected to an unmanaged switch. Then a wire goes to an N-Tron managed switch. This is duplicated around 30 more times with the N-Trons setup as a ring. The wire that goes between the N-Tron and unmanaged switch is very susceptible to pinching and breaking. The network is just supervisory and only the Magelis and CPX need connection to operate.

The problem arises when the wire becomes pinched and shorts. Under the right condition the wire can short and bring the whole network down. All CPX and HMIs loose their CIP connection. I have been able to bench test and duplicate the issue and snoop with Wireshark.

When the wire shorts it essentially loops the Tx and Rx (10/100) and creates a ghost device. Snooping between the PLC and HMI the CIP connection will fail when the PLC does an ARP request of who has “0.0.0.0”. Once this happens the TCP connection will not reestablish until the faulted wire is disconnected. I even tried just the unmanaged switch by itself with the PLC and HMI and the same thing happens. I will upload the captures tomorrow when I get in the office, just looking to see if anyone has seen this issue.

What I believe is happening is the ARP table is getting corrupted from the wire fault, but I’m not certain of this nor know how to prove it. I do know that no matter where in the network that I create a wire fault, eventually all CPX’s will do an ARP resquest for 0.0.0.0 and at that point it stop communicating until the wire is removed. Under normal operation, this never occurs. Also packet count and traffic stays low and normal with the wire faulted, no storm or loop back is occurring.

Tomorrow I’ll have some diagrams and captures to help out.
 
Also the wire that is susceptible to damage is routed in a poor design in a festoon style wire way. Most wires are tough, large SOOW style cables, and then there is this little dinky Cat5 cable that gets wrapped up and damaged. We are looking at other options for routing the cable but are stuck with this set up for now. Either way we are looking to make the network more stable for when issues arise.

I know that I if I take my faulted wire and plug it into the internet I can’t take the whole world down, so there has to be ways of isolation and management of the issue.
 
Because your unmanaged switch is unmanaged, you're probably only seeing the broadcast packets.

"ARP for address 0.0.0.0" is a duplicate address check packet called an ARP Probe, which is totally normal when a device is first connected to a network.

Are you saying that when you get a short circuit on one of your "droplines", that all of the ControlLogix and HMIs connected to other unmanaged switches on the network have their connections fail, not just the one whose cable was damaged ?

It does sound like a Layer 2 loop to me. Hmm.
 
This is what I would work on first.

I was waiting for that comment. We are, but at the moment we are stuck with the set up. Either way a network should not fail because of one bad pinched wire, especially when it’s not part of the backbone and just a connection point for a device.
 
Here is a basic layout of the network with how I am bench testing, the production network is identical just 30+ nodes.

I'm going to switch out my Netgear unmanaged switches with what we are actually using, which are Harting Ha-VIS eCon2050B-A 5 port unmanaged switches, just for the sake of trying to create a true identical network.

Ill work on getting some Wireshark captures.

Thanks for the comments so far.

Ken, the weird part about the ARP request is that it doesn't happen when the shorted wire is first plugged in. Sometimes it take a few minutes and is repeated every few minutes. Also the source of the ARP is the CPX.

network_layout.jpg
 
I suspect what is happening is the PLC is seeing an ARP request for 0.0.0.0 mirrored back to itself(since the Rx/Tx wires are shorted and this probably confuses the unmanaged switch) using its own MAC.
 
Here is a Dropbox for the Wireshark captures.

https://www.dropbox.com/s/2k66htbsk1o8dgc/wireshark_network_test_5.9.18.zip?dl=0

I updated the network drawing to show one more N-Tron.

The two normal captures are for comparison to show that there is NO network traffic increase, so no storm or loopback is occurring.

The "fault_at_unman_plc_snoop" is set up as followed:
I am tapped between the PLC and unmanaged switch monitoring traffic with my tap (PLC Snoop on diagram). I then plug in my faulted wire into the unmanaged switch. If you follow the capture at the 59.91 second mark, you will see that PLC "Rockwell_62:1c:78" broadcasts and ARP message of "Who has 10.5.32.12? Tell 0.0.0.0". This seems to disrupt the TCP session and will no restart until the wire is unplugged around the 142 second mark.

The "fault_at_ntron34_plc_snoop" is set up as followed:
I am tapped between the PLC and unmanaged switch monitoring traffic with my tap (PLC Snoop on diagram). I then plug in my faulted wire into N-Tron 34 where "Wire Fault 2" is located on the diagram. You can see around the 204 second mark where the TCP session drops.

It seems that its defiantly an ARP / MAC address problem. What I'm going to do next is remove the unmanaged switches out of the network and run the PLC and HMI straight to the N-Trons.

network_layout.jpg
 
Offhand, I don't think Spanning Tree will alleviate this totally. On a cisco device, I'd enable loopguard and/or udld on the copper ports as a workaround, but that won't help w/ the un-managed devices.

Edit: Wrong command..
 
Last edited:
With the unmanaged switches bypassed (PLC and HMI's plugged into the N-Tron's) it still causes issues.

I have a Stratix and NAT router that I'm going to test this on.
 
Thanks for those Wireshark captures and diagrams !

I'm not convinced that the ARP 0.0.0.0 is a cause of a connection drop, rather than a symptom.

You can see the low-level Ethernet packets for the N-Tron ring protocol (they're labeled Red Lion) still flowing rapidly after the interruption.

I wonder what kind of diagnostics the 708TX is logging when this event occurs.

What kind of tap are you using between the PLC and the switch ? I see all the Magelis HMI requests to the PLC, but none of the PLC replies to those requests, so I don't think we're seeing everything in Wireshark.
 
Last edited:
Ken, Ill poke around and see what they have for fault logs.

I threw a Stratix switch in and plugged in a PLC and HMI with a faulted wire and the thing is still chugging along. I'm running out of time to do anymore testing, so tomorrow I'll do more testing.
 

Similar Topics

I have an AB 1769 L18ER which is connected to our plant network via NAT device RAM 6021. About a week ago, its connection to our Ignition...
Replies
2
Views
353
I have been trying to get an Omron NX to communicate to a Toyopuc PC10G. I have established communications successfully, but I can only get the...
Replies
7
Views
1,437
I have a process network connected together with N-Tron unmanaged switches consisting of two CompactLogix PLCs, a 1794-AENT remote IO rack, two...
Replies
9
Views
4,482
I have an FTView SE network station application ver 10. I am having problems with the alarm banner. The alarm banner is showing only one alarm...
Replies
1
Views
1,426
I recently started at a new place that has a system that was installed in 2018 that has an L72, ~ 50+ E300, ~15 VFD and ~ 10 Stratix 5700...
Replies
7
Views
3,339
Back
Top Bottom