RootCause User Guide

Terminology and Concepts

rcc-6

RootCause® is an extension of the Aprobe® product, a powerful general-purpose patching tool that has been in use for years. As such, much of the terminology, organization and documentation of RootCause refer to those of Aprobe.

Here we describe general terminology and concepts that apply to RootCause and Aprobe, focusing on the RootCause product. A minimal amount of Aprobe documentation is supplied here, just enough to support the RootCause definitions. For additional information, see the Aprobe user's guide ($APROBE/Aprobe.pdf).

The RootCause Product

We use the terms application and program interchangeably throughout the RootCause product. An application or program is represented by an executable object module.

We use the term probe when describing what RootCause does: RootCause "probes" a running application. The probes created by RootCause do things like tracing, timing, data collection and more. Note that these probes are added only to the in-memory copy of the running application; RootCause does not modify the disk-resident application at all.

Each application that is probed by RootCause is assigned a workspace. A workspace is a directory where RootCause can put all of its important files (including the data collection files) at runtime.

Each workspace is created and initialized only once, when RootCause is first invoked on an application. Thereafter, RootCause automatically manipulates the workspace contents, so users can ignore the workspace during normal use. For each probed application, there is one workspace; and for each workspace, there is one application.

We use the term log as a verb to describe Aprobe's low-overhead data recording mechanism. RootCause logs its data into files in the workspace. For best performance, workspaces should be on a local disk (not remotely mounted).

The RootCause product is invoked by the command rootcause (see Chapter 9, "RootCause Command Reference").

The RootCause product consists of two major components: the RootCause Console and the RootCause Agent. The RootCause Console component allows you to create probes and examine the trace data generated by the probes. The RootCause Agent is the component that performs the actual runtime tracing and generates the trace data.

You can choose to install only the RootCause Agent on a remote computer and then use the RootCause Console's deploy operation to create a workspace for that remote computer. The deployed workspace can then be transferred to the remote computer for use by the RootCause Agent there, and the trace data can be transferred back to the RootCause Console for examination.

The RootCause Registry

The RootCause Agent is enabled on a per-process-group basis via an environment variable. When rootcause is "on" in your environment, RootCause will identify and optionally record the creation of every new process and subprocess created in that shell or subshells inherited from that shell in the RootCause log. If the executable for the process is listed in the RootCause registry, then RootCause will insert probes into that process to collect data the next time it is launched (when you registered the program, you specified the workspace associated with the executable and that workspace contains the probes). If the executable does not appear in the registry, then RootCause allows the process to continue undisturbed.

The RootCause Log

The RootCause log is the central reporting place for RootCause. By default, RootCause records every process that is started while rootcause_on is in effect. RootCause also writes error messages to this file. The default behavior when starting the RootCause Console GUI (with the rootcause open command) is to view this log, from which you may read error messages, open workspaces, and view trace results. The log is a fixed size, and wraps around to avoid growing too large. The rootcause log command manages attributes of this file.

Aprobe Product

Since RootCause is an extension of the Aprobe product, the setup scripts define an environment variable, APROBE, which identifies the RootCause installation directory. $APROBE is used throughout this manual to refer to the installation directory of the Aprobe and RootCause products. This section introduces some Aprobe terminology.

A tracing probe created with the RootCause Console is defined in an APC (AProbe C) file, which contains ANSI C code interspersed with Aprobe preprocessor directives. The APC file is compiled and linked using your native C compiler, and the resulting object code is written to a UAL (User Action Library) file. It is the UAL file that is used at runtime to dynamically apply your tracing probes.

As your program executes, tracing data is logged (i.e. written) to an APD (AProbe Data) file. Almost always, more than one APD file is allocated, and these files are used in a round-robin fashion (the oldest APD file is always overwritten). This set of APD files is referred to as an APD ring. There is a separate control file that is used to manage all the files in the APD ring; this control file is named the APD ring file.

APD files are written in a proprietary binary format. The apformat command reads APD, UAL and program files and generates readable text.

RootCause Data Management

Tracing an application raises a number of questions about managing the data that is recorded.

How can trace data be recorded quickly?
How can the least amount of data be recorded?
How can data recorded by multiple instances of the same program be preserved and organized?
How can the total amount of data recorded by bounded, to keep from filling up the disk?
How can users "snapshot" important data to be kept, while still bounding the total data collected?
How can users find what they're looking for in the data that is collected?

To address these issues, RootCause provides an interface to the Aprobe "log" mechanism which provides powerful and flexible data-recording and formatting capabilities. Here's how they work.

Recording Data Quickly

Aprobe logging uses memory-mapped files to record the data. Each process has its own set of data files mapped to distinct memory regions, which avoids bottlenecks and locking problems when several processes are logging data simultaneously. Thread-safety is managed primarily using a lock-free compare-and-swap mechanism, though some locking is still required when switching data files.

However, even though the files are memory mapped, the contents must eventually be written to disk, and this is limited by I/O speeds to the disk device. For this reason, it is very important that the workspace directory (where RootCause writes the data files) be "local" (directly connected to the machine where the traced program is running) and not accessed across the network (e.g., using NFS). If you are collecting data from a program running on multiple machines using the same workspace, or have other special requirements, contact OC Systems.

Recording Less Data

The Aprobe "log" mechanism automatically separates the recording of data from its display and formatting. For example, timestamps are recorded simply as 64-bit values at run-time, then formatted as desired later. String literals and labels are also added as part of formatting. This has the added benefit of being able to format the same data in multiple ways without rerunning the application.

Data for Multiple Processes

The data associated with each process is saved in a "Process Data Set", or "APD ring", a directory identified by the PID of that process. The user specifies how many of these should be saved, and when that limit is reached, the oldest Process Data Set is deleted when tracing of a new process is started.

Bounding Total Data

As mentioned above, the user specifies how many processes' data are to be saved. In addition, the amount of data saved for each process is also highly configurable. The data for each process is treated as a multi-file circular buffer, or "APD ring". Each file is called an "APD file" because of its suffix, ".apd". At the Aprobe level the user may specify the size of each file, the number of files in each ring; and the number of snapshot files saved. RootCause makes this a bit easier by offering the following parameters in the

RootCause Options Dialog:

Keep logged data for N previous processes This specifies the number of Process Data Sets to keep, as described above.
Data File Size (bytes) This specifies the size of each "APD file".
Wraparound data logging wraps at N (bytes) This specifies the total "wraparound buffer" size, which corresponds to the number of individual data files that are kept for each process before the oldest start being deleted.
Total logged data limit per process (bytes) Files may be preserved even when they might otherwise be deleted, using a snapshot mechanism described below. This parameter allows the user to set a hard upper bound on the total data per process, even when many snapshots are taken.

Data Snapshots

RootCause provides a mechanism for a "snapshot" to be taken programatically. This does not copy the data, but rather marks it as "preserved" so it is not deleted by the normal wraparound mechanism described above. RootCause allows a user to identify points during program execution at which a snapshot of the data is to be taken. At the Aprobe level, users may specify arbitrarily complex conditions under which a snapshot is taken. This mechanism is used by the java_exceptions predefined UAL, which causes a snapshot to be taken when selected Java runtime exceptions occur.

Data Indexing

RootCause provides four levels of control in accessing the data:

the Process Data Set;
individual Data Files;
special events in the Trace Index Dialog; and
individual events in the Trace Display.

Data is generally selected via the Trace Index Dialog, by double-clicking on a Process Data Set in the Workspace Tree, or by clicking the Index button. Index entries are shown for the "Last Data Recorded", and for any snapshots taken. In addition, any exceptions detected by the exceptions and java_exceptions predefined UALs may be shown in the Index. (We anticipate additional kinds of events being available through the Trace Index in future versions.) One or more events may be selected in the Index, and a Trace Display opened on the files surrounding that event. You can control the size of the context around the event via the RootCause Options Dialog.

From the Trace Index Dialog you can specify what Data Files the Index is to be generated from, and you can add data files from additional processes, and even additional workspaces. Using the Examine button in the Workspace Browser you can directly specify which Data Files are to be viewed, without going through the Index.

Once you have selected which data files to view, you can view all the data collected in the Trace Display. This shows all the events ordered by thread or by-time, and organized as a call tree within each thread. One can do textual searches through this display for specific events. Data may be added or removed from a Trace Display at the Data File level, and the RootCause Log may be merged with it as well to show the interaction of multiple processes.

RootCause Overhead Management

After a program has started, the overhead that a RootCause trace adds is proportional to the number of traced function calls made by the program as it's running. Often it's the case that the most-frequently-called functions are of little interest in the trace, and yet are introducing the most overhead.

RootCause provides several mechanisms to control tracing overhead, and focus the tracing to the time when it will provide the most information.

Load Shedding

RootCause manages tracing overhead by load shedding, a process by which tracing is disabled based on its estimated tracing overhead. This is done automatically by default, based on a heuristic analysis of CPU time used by traced functions. You can disable load shedding, or adjust the heuristics, with the Global Trace Options Dialog opened by clicking the Options button at the bottom of the Trace Setup Dialog.

When viewing the data, if there are any functions that were load shed, a LOAD_SHED node will appear in the Event Trace Tree, from which you can open a LOAD_SHED Table to see exactly which functions were disabled and when.

Using this table you may change the disposition of some or all of the functions during the next run. Usually you will simply want to select all the functions listed and change them to "Don't Trace" so they aren't traced at all in subsequent runs. However, occasionally a function will be disabled that is important to trace, and in this case you may mark that function as "Don't Shed" to force it to be traced regardless of the overhead.

Traced functions designated as "Don't Shed" are marked with a red dot in the Trace Setup Dialog. You can enable or disable load shedding on a specific function using the Trace Setup Popup Menu.

Enable/Disable Tracing

RootCause provides a mechanism to disable tracing at the start of the program, (or any other point) and enable it upon entry to (or exit from) a function executed later on. This is conceptually a global switch that can be turned on and off during program execution. So, for example, if your program does a lot of processing during initialization, but you're not interested in tracing that, you can:

Select the Program node in the Trace Setup Dialog, and using the Probes Pane configure a Probe On Program Entry to Disable Tracing.

Then you can select a function that is called at the start of the processing you want to trace and create a Probe On Entry to Enable Tracing.

As with most features available through the RootCause console, you can get even more control and power with a custom probe which directly calls the Aprobe runtime functions ap_RootCauseTraceEnable() and ap_RootCauseTraceDisable() to enable and disable tracing only if certain conditions or data values are detected in the program. Contact support@ocsystems.com for more information on writing custom probes.

Glossary

Discussion of RootCause and Aprobe requires the use of terms that are either specific to the products or assigned a special meaning in the context of the product. Many of these terms are also defined above and elsewhere in this guide, but are listed here for easy reference.

ADI file

Aprobe Debug Information file, which contains symbol and line information extracted from a module for use by deployed probes on applications which have been stripped of debug and symbol information.

APC

"AProbe C" language, a superset of the C programming language, used to define probes . An APC file is a text file containing APC source code. APC files are compiled into a UALfile using the apc command.

APD file

"AProbe Data" file, containing information written in a compressed format by the log command. These files are formatted (generally converted to text) by the apformat command.

APD ring

A set of APD file s corresponding to a single execution of an application . There is always one persistent file, "name.apd", and one or more ring files "name-1.apd", "name-2.apd", etc., grouped together in a directory having the same name as the persistent file, but with a suffix corresponding to the PID of the traced process, e.g., "name.apd.12345".

actions

Operations, generally gathering or counting data, that are applied at a certain point in a program. Actions, combined with the points where they are applied, make up probes.

agent

The part of the RootCause product which actually applies and enables the probes, also known as the Aprobe runtime.

apc

The Aprobe command which compiles an APC file into a UAL.

apformat

The Aprobe command which formats (generates text from) APD files.

application

An executable or JRE together with all the classes or shared libraries that it loads, also known as a program.

aprobe

The Aprobe command which actually applies probes to a program.

collect

To identify the APD files from one or more workspaces and compress them, along with other necessary information, into a single file with the suffix .clct, usually in preparation for moving from a remote machine back to a local Console installation for analysis.

Console

The RootCause GUI and supporting Aprobe tools (e.g., apc, apformat), through which all development and analysis of traces is performed.

Data File

A file containing RootCause data logged when a process is run under rootcause; another name for an APD file.

decollect

To expand a .clct file back into a directory with suffix .dclct for analysis by the RootCause Console.

decollection

the .dclct directory tree created by the decollect operation.

dynamic module

A shared library that is explicitly loaded by a program after execution has begun. See "Add Dynamic Module" and "Dynamically Loaded Libraries".

deploy

To compress the information in a workspace into a file with a .dply suffix, and transfer that file to a remote computer for tracing an application there.

event

any of a number of specially-tagged data items logged by RootCause and shown in the Trace Index Dialog or Event Trace Tree, or printed by the rootcause format command.

executable

A binary object file containing the entry point of an application which may be run directly; this is distinct from a shared library, which must be loaded in the context of a running executable.

format

To process the contents of APD files into text, or other meaningful output, using the Aprobe apformat command. Data collected by RootCause is generally formatted into XML before being read into the RootCause Console.

GUI

The Graphical User Interface portion of the RootCause Console. See Chapter 8, "RootCause GUI Reference".

JRE

"Java Runtime Environment", the environment which directly executes Java applications. The RootCause GUI is implemented in Java and so ships with a JRE. RootCause for Java allows you to define Java traces for supported JREs.

JVM

"Java Virtual Machine", the portion of an application (e.g., java, Netscape) which loads and executes Java class files and applets. This is generally implemented as a single library within the JRE.

load shedding

The process of dynamically disabling tracing of functions or methods based on the tracing overhead they introduce into the program. This mechanism prevents tracing from slowing down a program too much, and automatically creates a list of methods to be eliminated from subsequent traces.

local

Referring to the machine and execution environment in which the RootCause Console is installed, where traces, and perhaps the traced application itself, are developed; the opposite of remote, where the agent is installed.

log

verb: the recording of data by RootCause info an APD file.

log

noun: the RootCause Log, in which information about processes is recorded.

module

A loadable object module; an executable or shared library. In RootCause for Java, a class and all its supporting classes are managed as a module as well.

PID

Process ID, the number assigned to each process on the system, and used to uniquely identify each APD ring generated by tracing that process.

Process Data Set

The group of Data Files associated with a single process (PID); another name for an APD ring.

predefined UAL

A ready-to-use UAL which may be applied to any application to perform a specific function. Some are provided with RootCause, additional ones with Aprobe, and more can be developed by users.

probes

Actions to be performed at specific points in a program. These actions are applied at the probe points in memory, without modifying the program files on disk.

program

An executable or JRE together with all the classes or shared libraries that it loads, also known as an application.

register

To associate a program with a workspace, so that tracing will occur when the program is run with rootcause on.

registry

The database mapping programs to workspaces, and recording other information that must be checked when programs are run with rootcause on. This is implemented as a text file named by the environment variable APROBE_REGISTRY.

remote

Refers to a machine or execution environment separate from that in which an application is developed; the opposite of local. In a remote environment, the modules that make up a program may be fully or partially stripped, and the workspace in which the probes were developed is not accessible, so the workspace must be deployed.

run with rootcause on

To execute a program in an environment where RootCause is intercepting processes. This is generally done by running the rootcause_on command, then running the application. (On AIX, use the rootcause run command; see "Enabling RootCause for an AIX Application"). Some simple applications may be run directly with the Rootcause GUI Run button.

shared library

A linked object file which may be shared by many programs, but cannot be run by itself.

shadow header file

Is a C header file which provides debug information for the system library of the similar name. For example, $APROBE/shadow/libc.so.h is a shadow header for libc.so on Solaris

snapshot

A copy of data saved at the point of a notable event. In the context of RootCause, SNAPSHOT probes may be inserted which ensure that the associated data is preserved.

stripped

An application which was built with debug and symbol information, but from which that information has subsequently been removed (such as by running the strip(1) command), is referred to as a "stripped" application.

timestamp

a string indicating the "wall-clock time" at which an event occurred.

traceback

A display of the call stack starting with the function/method in which the traceback was generated, followed by its caller, then its caller's caller, etc.

traces

A subset of probes which quickly record the entry and exit of identified functions or methods. These are indicated in the GUI Trace Setup Dialog by black dots next to the entities containing traces, as distinct from probes.

tracing

The process of applying the traces and probes in a RootCause workspace to an application. We use this term in general to refer to the data-gathering that takes place when a registered application is running with rootcause on.

trigger

The point at which an action takes place. In particular, when defining probes within the Probes Pane, it may be the entry or exit of a program, thread, or function.

UAL

"User Action Library", a shared library defining "user actions" or probes, and the program points to which they are applied. A UAL is compiled from one or more APC files.

workspace

A directory with suffix .aws created and managed by the RootCause GUI and rootcause command, which contains the traces and probes on an application, the APD rings generated from those, and scripts for formatting that data.

XML

"eXtended Markup Language", a text language for expressing hierarchical information. RootCause formats the APD files collected by tracing into an informal XML syntax which is then consumed by the GUI.

Next Previous Index Top

RCUG 3 Terminology and Concepts

Contents