ControlNet Communications Timeout Problems

mlkramer2 · Jun 21, 2008

Any help on this would be GREATLY appreciated!

I have an AB controlnet network with ten racks. I am occasionally getting failures, currently about once every two days where a rack or racks will simply stop communicating. The only way to bring them back is to reset the controlnet bridge module (pull it out and put it back in). This network is controlling a reheat furnace which can be up to 2500 degrees, so these comm issues can be VERY unpleasant.

I sometimes have multiple racks at various points along the network fail simultaneously, other times it will be just one at any point along the network. I set up some error trapping logic today and had a rack fail this evening and it gave me fault codes of 515 and 516, which I believe are communications timeouts.

I used to have continuous errors of all kinds on the network, until we went through and replaced every connector (sloppy installation work) and now I don't get any until I have a complete failure of a rack.

Does anybody have any fixes for this or common experiences? Is there a way for me to increase the timeout time?

Thanks in advance....

Ken Roach · Jun 21, 2008

If it were my system, I'd go after like this:

1. Eliminate possible bad backplane bridge chips. This is RA product service advisory from May 2007 that every RA technician and engineer is familiar with.

2. Check for more noise or bad packets that may remain after your media repairs. Use the ControlNet Netchecker and a pad of paper to note the signal conditions as the system operates, and take notes if the signal changes when specific machinery is operating. Do a wiggle-test walkthrough of the system while the NetChecker is attached. Attach an oscilloscope to the NetChecker and examine the actual waveform of the ControlNet.

3. Use RSLinx as well as the ControlLogix Task Monitor to monitor the network statistics of all your 1756-CNB's.

4. Use your controller to monitor the network statistics of all your 1756-CNB's. I have a collection of obsolete 1756-L1's that make great local backplane statistics monitors.

mlkramer2 · Jun 23, 2008

Well, after I read your post about the backplane chips and I found the technote on it, I thought that was going to be it. I had two "possibly affected" modules, but when I pulled the cards to inspect them, they had a different chip type. So, I will continue with your other steps.

As another example of the behavior I am getting...I currently have two cards with 516 fault codes, that seem to operating ok. When I look at the card physically I have solid green lights, and no errors on the display. But bringing up the module properties from RSL5000 it says under Module Fault, (Code 16#0204) Connection Request Error: Connection request timed out.

If it has that fault up, how is it still running? The Backplane status says OK, and pressing the Clear Fault button there does nothing...

I did find two CNB cards that were not seated properly, not sure if this could have been part of the problem.

Oakley · Jun 23, 2008

Have you reviewed the number of connections through each of the CNB modules? If you have exceeded 64 connections, you would get connection time out errors.

mlkramer2 · Jun 23, 2008

Just checked, and I appear to be well under the 64 connection limit on each module.

Ken Roach · Jun 23, 2008

Explain exactly what your fault trap logic is doing and where you're getting these "fault codes of 515 and 516". I am not familiar with those codes.

It certainly doesn't make sense to me that your controller would be detecting an ordinary timeout (the 0204 error code in the Module Connection tab) but the 1756-CNB adapter is sitting pretty with no fault codes or flashing green LEDs. That sounds like a "lockup" of the 1756-CNB adapter to me.

Your system might benefit from the ministrations of an experienced RA engineer; the Detroit office has a substatial staff of guys who know ControlNet and have tools and analyzers that will get to the root of the problem a lot faster than our Internet correspondence.

Also: what "Clear Fault" button ?

Additional question: When the remote chassis are failing to respond to the controller, do they show only the word "OK" on the display, or do they alternate between "OK" and "A#nn" where nn is their node number ?

The display on a healthy 1756-CNB should *always* be alternating between OK and the "A#nn", and -CNBs with Series D or Series E firmware will also iterate through information on the Keeper and the CPU Utilization of the module.

mlkramer2 · Jun 23, 2008

Thanks for taking the time to think about this...

Ken, I will attempt to reply from top to bottom:

I am getting the fault codes by using a GSV instruction with Module as the class name, FaultCode as the attribute name, and the instance name is the rack that it is looking at. Then I am just using a FIFO load which loads the current fault code value anytime it changes from last scan. Some web-surfing yielded me explanations for 515 (Connection Timed Out) and 516 (Unconnected Request Timed Out). Not sure what the difference is. I put this in to trap the last ten fault codes when I was wondering if I was getting any faults even when communications wasn't actually failing.

I don't understand how it can detect the timeout without any noticable problems either. I am still getting the fault description I listed above on the Connection tab of the Module Properties window, yet in the Module Info tab it shows "None" for Major or Minor faults.

The "Clear Fault" button I was talking about is the one on the "Backplane" tab under Backplane Status. Though it says "ok" in there anyway.

One of my fellow engineers has suggested that perhaps this might be a power supply issue, because we only seem to have failures when we are running, not when the plant is down for maintenance (but the controlnet is still up and running). Are there any documented power sensitivities with Controlnet?

Edit: Just saw your last question. No, when the racks are not communicating I am getting red lights and error codes on the display. I have re-created 515 and 516 codes by removing the connection to the CNB card and plugging it back in, but then the card re-establishes a connection by itself. When they fail, nothing brings it back other than a power cycle. I plan on writing down the error codes on the display on the next failure that I am here for...of course they have been on off-shifts the last couple times.

Oakley · Jun 23, 2008

Could it be noise?

Could it be a bad end on the cable (like a frayed shield)?
How about terminating resistors?

Are these CNB or CNB/R modules?

mlkramer2 · Jun 23, 2008

I am not getting any significant noise detections on my module statistics. I used to get a lot but a while ago we replaced every connector on the network, which cleared up about 99% of it.

These are CNB modules. Terminating resistors are in place.

Ken Roach · Jun 23, 2008

Thanks for the detailed explanations of the steps you've taken and the symptoms you've experienced.

Let's pause right here to say this sounds like a physical disconnection or noise problem to me. The network diagnostics are being a little confusing because we're overthinking the problem.

Fault Codes: Think of a ControlNet connection like a phone call. The 515 (Connection Timed Out) error code shows up when the phone call gets disconnected and you suddenly don't hear the person on the other end of the line. The 516 (Unconnected Request Timed Out) error code appears when you call back and get a busy signal or the "your call cannot be completed at this time...." message.

Good thinking, by the way, with the FIFO stack to record connection failures.

Connection Tab versus Module Info Tab: The Connection tab on the Module Properties window in RSLogix 5000 for the ControlNet remote chassis 1756-CNB adapter module has the important information when you have a network problem.

You'll see error codes like 0x203 and 0x204 for timeouts, then 0x317 for Connection Not Scheduled, then the module will go into a mode where there is no error code just the description "Waiting" while the controller and 1756-CNB's attempt to reconnect.

The Module Info tab is only informative when you have a connection to the module; when you click on Module Info during a disconnection there should be a popup message that says "Failed to perform operation because of module state".

When RSLogix 5000 cannot communicate with a module, it shows you the information it had on the module state before the communication failure.

For the same reason, the "Reset Module" button on the Module Info tab and the "Clear Faults" button on the Backplane tab have no effect when RSLogix 5000 can't communicate with the remote 1756-CNB.

A diagnostic test I recommend is to browse the ControlNet using RSLinx Classic while the system is running, and right-click and select Module Statistics. Select the Port Diagnostics tab and take a screenshot of it, then go back a few minutes later and take another screenshot.

That will allow you to compare the statistical counters that might reveal noise or physical connection problems in the Controlnet media.

mlkramer2 · Jun 23, 2008

I actually just reset all my statistical counters about 20 minutes ago, then I will take a look at them in the morning to see where they are. This network used to be so noisy that I could just open that window and watch many of the counter count up, but that has largely gone away with the connector changes. Now it seems I only see any errors when I lose a rack, then I will have one or two "out of step events" or something like that. I will update with my findings tomorrow...

Thanks again.

Oakley · Jun 23, 2008

How about noise on the incoming power? Any issues with the power supplies?

Has there been and recent changes to the system?

Does your RSNetworx and IO tree match the physical network? How about firmware revisions? Do they match what is in RSNetworx and the IO tree?

mlkramer2 · Jun 24, 2008

Here is what I had on my module statistics this morning, about 12 hours after I zeroed everything. One rack locked up last night, one of the ones that was showing an error all day. (The other is still showing the error but running fine)

On the rack that locked up there were three Out of Step Events.
On one other rack, there were three Out of Step Events, and one "Received Bad", and it indicated the bad frame was received from the main processor rack.
All other racks had no errors at all.

These two racks are in separate cabinets. Each cabinet has three racks and these were the middle racks in each cabinet.

What is the cause of Out of Step Events? Never been able to get a clear explanation of that...

mlkramer2 · Jun 24, 2008

Oakley said:
How about noise on the incoming power? Any issues with the power supplies?

Has there been and recent changes to the system?

Does your RSNetworx and IO tree match the physical network? How about firmware revisions? Do they match what is in RSNetworx and the IO tree?

No recent changes (and I have another identical system that is running fine), I need to check the incoming power. Any idea as to how much noise on the power a cnet system can handle?

ControlNet Communications Timeout Problems

mlkramer2

Member

Ken Roach

Lifetime Supporting Member + Moderator

mlkramer2

Member

Oakley

Member

mlkramer2

Member

Ken Roach

Lifetime Supporting Member + Moderator

mlkramer2

Member

Oakley

Member

mlkramer2

Member

Ken Roach

Lifetime Supporting Member + Moderator

mlkramer2

Member

Oakley

Member

mlkramer2

Member

mlkramer2

Member

Similar Topics