S7 - Error checking problems with multi-instance FB

RMA · Nov 15, 2005

I'm still having problems with my error checking for the motor-driven switches. Today I had one error pop up several minutes after there had been any activity on the system, which suggests to me that the problem really is a program problem.

"Interrupt" - while writing this, I've just had another error fully 20 minutes after the last time anything was done on the system!

The error calls are as in the following screen dump. There is a similar network for each of the 21 modules with the network for any unselected modules being skipped (at the moment we are only ever working with one module at a time). In addition there are the checks for the 20-odd general purpose switches handling the connections to the various Labs.

I should perhaps mention that at the present time we are not using the Crowbar resistors (or rather, they're shorted out), so there is never any activity with the Crowbar switch (LTS_Crowbar).

The fault is always occurring on the LTS_NEG switch, despite the fact that we are testing approximately 50-50 with both positive and negative charging.

I'm wondering if the problem has something to do with the fact that I'm calling the Error_Check DB three times immediately after one another in the same cycle, although I wouldn't have thought that should be a problem for different Instances. The funny thing is that there has never been a fault from one of the Lab routing switches although there are more than 20 of them all called immediately after one another in the same network.

By the way, although I've posted the listing for Module 12, the fault occurs on all the modules which we've used so far.

Anybody got any ideas?

Ken M · Nov 15, 2005

Hi Roy

I know this is a terrible question to ask a 'serious' code-hound like yourself, but are you using TEMPs in your FBs, and if so, are you absolutely sure that you write to them before reading them?

When you get problems like this associated with multiple calls to an identical code structure it can be because of data values left in the local stack from the previous call. The subsequent calls will inherit the previous data values when the operating system allocates space from the local stack - S7 doesn't zero it or anything between allocations. Standard IN, OUT or IN_OUT parameters are always refreshed by the system on each call, and for FBs, the STATs are unique to each instance. But you can have the FB do a calc in to a TEMP, pass it on to an OUT or elsewhere, forget about it and not realise that as soon as you call the same FB immediately after, the same TEMP name is going to have that same value already in it. And of course any subsequent calculations may just eventually accumulate sufficient unwanted digits to give trouble.

Just a thought, and not, I hope, too much of an insult (Grannies, eggs, etc)

Regards

Ken

RMA · Nov 16, 2005

Good morning Ken, couldn't you get to sleep yesterday?

I'm well aware of the Temp problem, having been bitten a few times in the past, curiously enough, I actually read a Siemens FAQ on the problem before meeting it for the first time. However, I must admit, I haven't explicitly checked it this time, thanks for reminding me, I'll go and check it out now!

Edit: That didn't take long to check - Main FB one Temp - "Fault_Exists", cleared in the first Network. Actual error-check (called) FB - no Temps - so that's not the problem - but I hadn't checked it till now!

RMA · Nov 16, 2005

The actual error checking FB

Just for completeness, here's the actual error-checking FB. Previously I also had a counter in here to count the number of operations for maintenance purposes, but it occasionally ran wild (probably not unrelated to my current problem!), so I threw it out. I'll worry about counting operations once I get the basic error checking functioning properly!

RMA · Nov 17, 2005

I think I've got it!

I haven't had a chance to check it yet, but I think I've found the source of my problem - and if I'm right, then Ken was right as well with his comment about the TEMP variable problem.

I had a bit of luck while digging through the ProTool archives and found a Bit which controls the visibility of one of the buttons in a setup screen had been set on each of the three last occasions when the fault occurred - including the one which occurred while I was typing my original post, with nobody else in the room, never mind sitting at one of the OSs. Going back to the program which handles all of the special bits for the screen updates I discovered that this Bit was set by a massive OR-gate including 15 TEMPs none of which had been initialised anywhere!

Since this program was one of the earliest to be finished (creation date 03.11.2003 - as you know, this is a big project!) and I didn't know a fraction of what I now know about S7, at that point, I'm not really too surprised!

I guess the next job is a thorough check of all the programs to find out where else there are uninitialised TEMPs lurking.

Ken M · Nov 17, 2005

Roy

I hate this kind of problem. I don't know whether you've got any definite proof of the cause yet but, as I suspect we all know, when you're dealing with an intermittent issue the best first step is often to make it worse. You have to find out how to make it permanent or predictable. Then once you've applied the fix you can test for whether it's resolved the problem or not. If you try to fix an intermittent item, you never know whether it is cured, or whether it's just taking longer than usual to re-appear.

But you have my admiration for your detective work so far! Tracking it down to a single bit where the code was written over 2 years ago takes some serious investigation. Well done indeed.

Regards

Ken

RMA · Nov 17, 2005

Ken,

I haven't had a chance to check things out so far, but I hope we'll be able to get back to do some testing later on this afternoon.

In the meantime, I've been through every program block and made sure that all the TEMPs are initialised - quite a few of them, too many, weren't!

The only thing that worries me is that I've a feeling that we may have more than one problem. The three faults which occurred when I know (or at least in twoof the cases, I'm pretty sure) no one was using the OS are only part of the problem. We also get the same fault fairly regularly during the runs, although I'm not sure whether or not there's a pattern to it. It's also unclear at present whether it's always a fault, or whether some of them may be real (or apparent - we've had some earthing problems on a couple of the modules) discrepancy errors.

What also makes things more difficult is that the communications load for the logging is so high that even 1 sec pulses are sometimes being missed. I've upped the communications load to 50% and dropped ProTools basic scan rate to 100ms to try and cover this problem.

RMA · Nov 22, 2005

Solved one problem - got another instead!

After a bit more intensive testing it turned out that in fact (or at least, now) the fault occurs with each module switch, whether Crowbar, NEG or POS whenever they were used. The apparent intermittent nature of the problem initially, may, I suspect, have to do with the un-initialised TEMPs.

Somewhere along the line it occurred to me that we have never had a fault on the Kollektor - Lab routing switches, which are checked continuously, so I tried commenting out the jump to the next module if the module was not selected - end of problem!

At least, end of that problem. Instead I've now got another one! At the start of each run I do a reset of all the motor driven switches, to be sure they are open. For safety reasons I would prefer to keep this check (as it is at present) so that all switches are reset regardless of whether the module is being used or not. Net result, I now get a load of discrepancy errors (quite correctly) from those modules which are not switched on. Since I've got a feedback contact on the main On/Off switch, I could always just test the feedback if the module's switched on - but I'm 99% certain that would take me back to square one!

Apart from anything else, I don't like things I don't understand, so has anybody any suggestions why the check of the modules state in the first two lines of the network should cause a problem.

Just for information, the error checking does not run continuously, but it is started when the operator logs on to start an experimental run, so by the time he gets to the screen where he can select a module, with a cycle time of typically 6 - 7 ms, the error checking will definitely be running before any modules are selected. The modules are only deselected during the reset preceding the next run, i.e. long after the fault condition has occurred.

RMA · Dec 6, 2005

New problem, same FBs - counting switching operations.

I decided to wake this Thread up again, rather than starting another one, because all the relevant screen dumps are already here.

Now that everthing seems to be working pretty well and the High-Voltage guys are grinding through the testing and commissioning each module, I'm sitting here twiddling my thumbs. So I decided to go back and have a look at my counter for the number of operations of the motor-driven switches. The actual counter - an ADD_I command - adding one to the contents of the DB location defined by the IN-OUT parameter "LTS_TRANSIT_CNTR", I've added immediately under the #CSB_AUF and #CSB_ZU Reset commands in NWs 2 & 3 resp. in the actual error checking FB.

The problem of runaway counters mentioned in Post 4 has disappeared, presumambly along with the un-initialised TEMPs, however, although only one switch is being operated (at the moment the "NEG" switch), all three switches are counting. Since each switch is called as an instance in the calling FB, to say I'm a bit puzzled is serious understatement!

The other funny thing is that it looks as though some switching events may be being missed, although I'm not certain about when or why, but an odd count after a complete run shouldn't happen when I'm counting OPEN and CLOSE events!

If anybody has any suggestions, I'd be grateful.

RMA · Dec 7, 2005

Oops, sorry, hope I haven't wasted too much of anybody's time!

My counters are counting just fine. Unfortunately, I had forgotten that for safety reasons (because these switches can also be driven manually) when I send the signal to close one of the switches I simulataneously send the signal to open the other switch - which of course 99,99% of the time is already open! Instead of monitoring the outputs I need to monitor the feedback contacts to determine when the switch has actually moved. The fact that the Crowbar switch also had the same count as the NEG and POS switches turned out to be coincidence on the module I happened to look at and the occasional missed counts were down to a situation I hadn't thought of where the monitoring could get shut down before the switches were activated.

SimonGoldsworthy · Dec 7, 2005

Hadn't had chance to look at this one in detail - so no sweat !

S7 - Error checking problems with multi-instance FB

RMA

Member

Ken M

Member

RMA

Member

RMA

Member

RMA

Member

Ken M

Member

RMA

Member

RMA

Member

RMA

Member

RMA

Member

SimonGoldsworthy

Member

Similar Topics