Trace Data Lost messages while running EZ-Tracer
Posted by Ron Colverson on 02/02/17 @ 12:40 PM


If messages like these are seen in the DB2 master job output, then data is being lost and the workload will not be an accurate record. The message:

Broken IFCID Chain detected - Probable Data Loss

will also be seen in the XOPLOG output for the trace.

After EZ-Tracer has started the DB2 trace, DB2 collects and buffers the data as it occurs. Once the amount of data in the buffers reaches a pre-determined threshold, DB2 issues a post to EZ-Tracer to do a READA request and move the DB2 buffered data into the EZ-Tracer address space for processing. The buffer space is then released and available to DB2 again. Meanwhile DB continues to write to the rest of the buffer. Trouble starts when the DB2 buffer fills before EZ-Tracer has finished processing the data from the previous READA. EZ-Tracer is not yet ready to obtain the next collection of data and DB2 cannot wait for the buffers to be cleared, so it has to throw away the data, informing EZ-Tracer on the next READA request that data has been lost and displaying message DSNW133I. When EZ-Tracer eventually issues the next READA and clears the buffer, DB2 is able to resume writing data.

This scenario is not unique to EZ-Tracer; any similar product which uses the DB2 Instrumentation Facility can suffer from similar issues. IBM has 2 recommendations for this situation:

  • Increase the size of the OPn buffer.
  • Issue the IFI READA request more frequently so that the OPx buffer is read and the content is cleared before buffer storage is exhausted.

The rest of this article discusses how to achieve those in the context of EZ-Tracer. It is important to note that this process is entirely separate and independent from the interaction between the main EZ-Tracer job and the periodic consolidation/reporting 'B' jobs which process the LG01, LG02, etc. files. 

It is essential for the buffers to be processed and emptied by EZ-Tracer faster than DB2 fills them. As long as that is the case, even if it is only a little bit faster, it all works without problems. As soon as the balance tips the other way, eventually DB2 will run out of available buffer space and data will be lost. This is why we recommend that EZ-Tracer runs at a higher priority than DB2. It’s not that it requires large amounts of CPU; it just cannot afford to wait.

The first thing to check is that the dispatching priority of the main EZ-Tracer job is higher than DB2. In our experience it is almost always this that resolves the problem.

The next two things to check are the buffer size and the threshold at which the EZ-Tracer is notified that there is data to process. These parameters are set on the main Start Full SQL Trace panel:

Trace Buffer.....: DB2 Trace Buffer Size...........: 65536 (K) Thresh 1024_ (K)

We recommend you specify the maximum allowed Buffer Size which is 65536 (K) for V9 and later. As for Threshold, when tracing individual systems, start at 25 percent of buffer size. When tracing DB2 groups specify Buffer Size / No of members in the group, or 25% of buffer size, whichever is lowest. It may be useful to experiment with the threshold parameter for best results.

As a last resort, if you can’t get EZ-Tracer to offload the data fast enough, you have to reduce the amount of data being collected which will speed up the processing of the buffer contents. This can be done by one or more of the following:

  • Trace single subsystems rather than all the members in a data sharing group
  • Specify trace filters
  • Specify trace sampling
  • Specify trace thresholds to only trace expensive SQL
  • Reduce the numbers of IFCIDS collected by not collecting fetches, etc.

The last 3 are specified on EZ-Tracer Trace Filter Parameters - Screen 2. The action you choose depends on what you are trying to achieve with the trace. Consider whether the volumes of data collected are too high to be effectively analysed. It may be that a more selective approach is more appropriate. Also consider whether EZ-Cache, which inherently handles much smaller amounts of data, would be a better tool. 

This all applies to return code RC=08 on the DSNW133I message. If RC=04 is seen, then this means that “No OPn buffer is assigned to an application to collect data.” This is a different situation altogether and is usually unrelated to EZ-Tracer.