by Caryl Takvorian
(April, 2003)
We want to hear from you! Please send us your FEEDBACK.
The following article may contain actual software programs in source code form. This source code is made available for developers to use as needed, pursuant to the terms and conditions of this license.
The Performance Tools included in the Forte Developer 7 IDE introduced support for profiling Java applications, but data collection only included machine representation of the Java Virtual Machine (JVM). Now with the release of Sun ONE Studio 8 Compiler Collection, developers have access to both machine and Java progamming language representations while analyzing their experiments.
This article aims to present the new Java programming language features of the Performance Tools in Sun ONE Studio 8 Compiler Collection, (but not to explain how to analyze the results).
To run the examples in this article, you must have the following software installed on your system:
Sun ONE Studio 8 Compiler Collection which includes the Performance Tools
Java 2 Platform, Standard Edition, version 1.4.2 (J2SE)
The jsynprog example program is
provided with the Sun ONE Studio 8 software release. The source code
and makefile can be found in the Performance Analyzer example
directory: install-directory/examples/analyzer/jsynprog.
To
compile the example with the default options, simply change to the
jsynprog directory and run the make
command:
$ cd
/opt/SUNWspro/examples/analyzer/jsynprog
$ make
The examples assume that the path of each software is defined in your
PATH environment variable.
Java version 1.4.2 is recommended
for all uses, both for Java data collection, and for running the
Analyzer, whether on Java codes or codes recorded on targets written
in the traditional languages, C, C++, and Fortran.
Java version 1.4.1 is acceptable for running the Analyzer, but is
not recommended for data collection, as all JVM housekeeping is shown
as the JVM functions themselves, and a significant amount of time
will be shown in the "<Unknown>
function."
Java version 1.4.0 can be used to do performance measurements for Java programs, although no Java mode profiling is supported. It also requires various OS patches.
An attempt to collect data with an earlier version of the Java platform will report an error, and exit. An attempt to bring up the Analyzer with an earlier version will report an error, but will allow you to continue.
To the typical Java platform developer, a Java application runs
just as any other program runs. The program begins at a main entry
point, typically named class.main. That method may then
call other methods, which may call other methods, just as a C or C++
application does.
To the operating system, a Java application, either written in
pure Java code or Java code mixed with C/C++, runs as a process
instantiating the Java Virtual Machine (JVM). The JVM is compiled
from C++ sources and starts execution at _start, which
calls main, and so forth. It reads a Java bytecode
program from .class and/or .jar files, and
performs the operations specified in that program. Among the
operations that can be specified are the dynamic loading of a shared
object, and calls into various functions or methods contained in that
object.
During execution of a Java application, most Java methods are interpreted by the JVM; these methods are referred to below as interpreted methods. Other Java methods may be dynamically compiled using the Java HotSpot virtual machine ("HotSpot"), and are referred to as compiled methods. Dynamically compiled methods are loaded into the data space of the application, and may be unloaded at a later stage. For any particular Java method, there will be an interpreted version, and there may also be one or more compiled versions. Java code may also call directly into compiled code, either C, C++, or native-compiled ("NJC") Java code; the targets of such calls are referred to as native methods.
The JVM does a number of things that are typically not done by applications written in traditional languages. When the JVM starts, it creates a number of regions of dynamically-generated code in its data space. One of these is the actual interpreter code used to process the application's Java bytecode methods.
During the interpretive execution, the Java HotSpot JVM monitors performance, and may decide to take one or more methods that it has been interpreting, generate machine code for it (or them), and execute the more efficient machine code version, rather than interpret the original version. That generated machine code is also in the data space of the process. In addition, other code is generated in the data space to execute the transitions between interpreted and compiled code.
A Java program is inherently multithreaded, and has one JVM thread
for each thread in the user's Java program. It also has several
housekeeping threads used for signal handling, memory management, and
HotSpot compilation. Depending on the version of libthread.so
used, there may be a one-to-one correspondence between threads and
lightweight processes (LWPs), or a more complex relationship. But for
any version of the library, at any instant a thread may be
unscheduled, or scheduled onto an LWP. Data for a thread is not
collected while that thread is not scheduled onto an LWP.
The Sun ONE Studio Performance Tools collect their data by recording events in the life of each Light-weight process of the process, along with the callstack at the time of the event. At any point in the execution of any application, a Java application or otherwise, the callstack represents where the program is in its execution, and how it got there. One important way that mixed-model Java applications differ from traditional C, C++, and Fortran applications is that at any instant during the run of the target there are two callstacks that are meaningful: a Java Virtual Machine callstack, and a machine callstack.
Clock and Hardware-counter profiling for Java programs work just as for C, C++, and Fortran programs, except that both JVM and machine callstacks are collected.
Synchronization tracing for Java programs is based on events generated when a thread attempts to acquire a Java Monitor. Both machine and JVM callstacks are collected for these events, but no synchronization tracing data is collected for internal locks used within the JVM.
Heap tracing data records object allocation events, generated by the user code, and object-deallocation events, generated by the garbage collector. In addition, any use ofmalloc,free, etc. also generates events that are recorded. Those events may come from native code, or from the JVM itself.
Java application profiling uses the Java Virtual Machine Profiling Interface (JVMPI) interface, which can cause some distortion and dilation of the run. For clock and hardware counter profiling, the data collection process makes various calls into the JVM, and handles profiling events in signal handlers. The overhead of these routines, and the cost of writing the experiments to disk will dilate the runtime of the Java program. For the most part, we estimate that dilation at less than 10%.
In addition, although the default garbage collector supports JVMPI, there are other garbage collectors that do not. Any data-collection run specifying such a garbage collector will get a fatal error.
For Heap profiling, the data collection process uses JVMPI events describing memory allocation and garbage collection, and that can cause significant dilation in runtime. In most Java applications there are a great many of these events, which will lead to large experiments and scalability problems processing the data. Furthermore, if these events are requested, the garbage collector disables some inlined allocations, costing additional CPU time for the longer allocation path.
For synchronization tracing, data collection uses other JVMPI events, which will cause dilation in proportion to the amount of monitor-contention in the application.
collect Command on the JVMThecollectcommand recognizes that the target supplied is a Java class file orjarfile, and automatically invokes the JVM with appropriate flags for data collection. Alternatively, the user may explicitly use a JVM as the target, and supply a -j on flag, so that the collect command can interpose the necessary arguments for data capture to the JVM argument list. For 64-bit JVM runs, the 64-bit JVM must explicitly be listed as the target, and the -j on flag supplied.
Basic options include:
-p value - Turn
on clock profiling
-h counter, value
- Hardware counter profiling
-s value -
Synchronization tracing
-H on - Heap
tracing
Unlike with C, C++ or Fortran, it
is not yet possible to collect data directly from dbx,
either when launching a process or when attaching to a running
process.
$/opt/SUNWspro/bin/collect -j on /usr/java/bin/java jsynprog Creating experiment database testcase.er ... Directing output to acct file... libpath:/Work/JavaProfiling/jsynprog jvm Java HotSpot(TM) Client VM Loading 32-bit library.... Running Routine.memalloc Running Routine.add_int Running Routine.add_double Running Sub_Routine.add_int Running Routine.has_inner_class Running Routine.recurse Running Routine.bounce Running Routine.array_op Running Routine.vector_op Running Routine.sys_op Running jsynprog.jni_JavaJavaC Running jsynprog.JavaCC Running jsynprog.JavaCJava
The data collected can then be analyzed
using either the Analyzer GUI program, or the command-line utility
er_print.
The Analyzer is a Java-based GUI program that can read one or more experiments, and allow you to navigate through the recorded data. It can show function, caller-callee, source, and disassembly data, as well as a Timeline showing the profiling events in time-sequence. It also has additional panes to show statistics data, a leak- and allocation-list, and information about the experiments recorded.
Further information about the Analyzer may found on theanalyzer(1)manpage.
er_print Commander_print
is a command-line program that can read one or more experiments, and
produce reports showing the data. The reports match the printed
output available from the Analyzer GUI.
Further information abouter_printmay found on theer_print(1)manpage.
The following examples are based on the
er_print utility and are limited to 10 entries to
improve readability, although some of the operations described, such
as Timelines, can only be viewed when using the Analyzer application.
There are three modes of displaying the data
from Java programs: the Java Representation, the Expert-Java
Representation, and the Machine Representation. By default, where the
data supports it, the Java representation is shown. The Analyzer has
a radio button in the Data Presentation Dialog for turning Java mode
to on, expert, or off.
Similarly, with er_print the command is , javamode
onjavamode expert,
or .
javamode off
The Java representation shows compiled and interpreted Java methods by name, and shows native methods in their natural form. During execution, there may be many instances of a particular Java method executed, the interpreted version, and, perhaps, one or more compiled versions. In the Java representation they are all shown aggregated as a single method.
$er_print testcase.er (er_print) limit 20 (er_print) functions Functions sorted by metric: Exclusive User CPU Time Excl. Incl. Name User CPU User CPU sec. sec. 46.372 46.372 <Total> 7.105 7.105 Routine.add_double 6.324 6.324 java.util.Vector.remove 5.474 5.474 Routine$1JInner.buildlist 4.393 4.393 Routine.memalloc 3.733 3.733 Java_jsynprog_JavaJavaC 3.713 3.713 cfunc(int) 1.881 1.881 Routine.add_int 1.531 1.531 <no Java callstack recorded> 1.431 1.431 Routine.bounce 1.431 1.431 Routine.recurse 1.331 1.331 Sub_Routine.addcall 0.921 0.921 jsynprog.javafunc 0.600 1.931 Sub_Routine.add_int 0.120 0.120 Routine.sys_op 0.080 0.080 java.util.Vector.add 0.070 0.070 Routine.allocate_array 0.020 0.110 Routine.allocate_vector 0.020 6.455 Routine.vector_op 0.010 0.010 AbstractInterpreter::method_kind(methodHandle) (er_print)
A PC for a Java method in the Java representation corresponds to the method id and a bytecode index into that method; a PC for a native function corresponds to a machine PC. The callstack for a JVM thread may have a mixture of Java Virtual Machine PCs and machine PCs. It will not have any frames corresponding to Java housekeeping code, which does not have a Java representation.
(er_print) pcs Objects sorted by metric: Exclusive User CPU Time Excl. Incl. Name User CPU User CPU sec. sec. 46.372 46.372 <Total> 7.095 7.095 Routine.add_double <HotSpot-compiled leaf instructions> 6.324 6.324 java.util.Vector.remove <HotSpot-compiled leaf instructions> 5.474 5.474 Routine$1JInner.buildlist <HotSpot-compiled leaf instructions> 4.393 4.393 Routine.memalloc + 0x00000011 1.881 1.881 Routine.add_int <HotSpot-compiled leaf instructions> 1.851 1.851 cfunc(int) + 0x00000038 1.551 1.551 Java_jsynprog_JavaJavaC + 0x00000040 1.511 1.511 <no Java callstack recorded> + 0x00000000 1.421 1.421 Routine.bounce <HotSpot-compiled leaf instructions> 1.421 1.421 Routine.recurse <HotSpot-compiled leaf instructions> 1.331 1.331 Sub_Routine.addcall <HotSpot-compiled leaf instructions> 0.911 0.911 jsynprog.javafunc <HotSpot-compiled leaf instructions> 0.590 0.590 Java_jsynprog_JavaJavaC + 0x00000050 0.590 0.590 Java_jsynprog_JavaJavaC + 0x0000005C 0.580 0.580 Sub_Routine.add_int <HotSpot-compiled leaf instructions> 0.540 0.540 cfunc(int) + 0x00000048 0.390 0.390 cfunc(int) + 0x00000054 0.330 0.330 Java_jsynprog_JavaJavaC + 0x00000058 0.260 0.260 cfunc(int) + 0x00000044 (er_print)
The function list in the Java representation shows metrics against the Java methods, and any native methods called. The callers-callees panel shows the calling relationships in the Java representation. For the housekeeping threads, only a machine callstack is obtained.
Source for a Java method
corresponds to the source code in the .java file from
which it was compiled, with metrics on each source line. The
disassembly of any Java method shows the bytecode generated for it,
with metrics against each bytecode, and interleaved Java source,
where available.
The Timeline in the Java representation shows all threads, both Java threads, and housekeeping threads. Each Java thread's event callstack is shown with its Java methods.
Java programs allocate memory for instantiating classes and storing data, but, unlike C and C++ applications, they do not explicitly deallocate the memory. Rather, memory is managed by the garbage collector. That code, a part of the JVM, periodically scans memory to find allocated areas that are no longer pointed to elsewhere in the program, and reclaims the memory, deallocating it and making it available for other uses.
Heap tracing in the Java representation is based on the Java memory management, and JVMPI events; data from normal Heap tracing is shown in the Java representation, as well.
All Java programs may have explicit synchronization, usually performed by calling the monitor-enter routine. Synchronization-delay tracing in the Java representation is based on the JVMPI synchronization events. Data from the normal synchronization tracing is not shown in the Java representation.
The Expert-Java representation is similar to the Java representation, except that HotSpot-compiled methods are shown independently of the interpreted version of the method, and some of the details of the JVM internals that are suppressed in the Java representation are exposed in the Expert-Java representation.
$er_print testcase.er (er_print) limit 20 (er_print) javamode expert (er_print) functions Functions sorted by metric: Exclusive User CPU Time Excl. Incl. Name User CPU User CPU sec. sec. 46.372 46.372 <Total> 7.105 7.105 Routine.add_double 6.324 6.324 java.util.Vector.remove 5.474 5.474 Routine$1JInner.buildlist 4.393 4.393 Routine.memalloc 3.733 3.733 Java_jsynprog_JavaJavaC 3.713 3.713 cfunc(int) 1.881 1.881 Routine.add_int 1.431 1.431 Routine.bounce 1.431 1.431 Routine.recurse 1.331 1.331 Sub_Routine.addcall 0.921 0.921 jsynprog.javafunc 0.801 0.911 constantPoolKlass::oop_follow_contents(oopDesc*) 0.700 0.700 gettimeofday 0.600 1.931 Sub_Routine.add_int 0.560 0.560 constantPoolKlass::oop_adjust_pointers(oopDesc*) 0.520 0.520 __div64 0.360 0.580 ContiguousSpace::prepare_for_compaction(CompactPoint*) 0.280 0.280 SymbolTable::unlink() 0.270 0.270 MarkSweep::AdjustPointerClosure::do_oop(oopDesc**) (er_print)
(er_print) pcs Objects sorted by metric: Exclusive User CPU Time Excl. Incl. Name User CPU User CPU sec. sec. 46.372 46.372 <Total> 7.095 7.095 Routine.add_double <HotSpot-compiled leaf instructions> 6.324 6.324 java.util.Vector.remove <HotSpot-compiled leaf instructions> 5.474 5.474 Routine$1JInner.buildlist <HotSpot-compiled leaf instructions> 4.393 4.393 Routine.memalloc + 0x00000011 1.881 1.881 Routine.add_int <HotSpot-compiled leaf instructions> 1.851 1.851 cfunc(int) + 0x00000038 1.551 1.551 Java_jsynprog_JavaJavaC + 0x00000040 1.421 1.421 Routine.bounce <HotSpot-compiled leaf instructions> 1.421 1.421 Routine.recurse <HotSpot-compiled leaf instructions> 1.331 1.331 Sub_Routine.addcall <HotSpot-compiled leaf instructions> 0.911 0.911 jsynprog.javafunc <HotSpot-compiled leaf instructions> 0.590 0.590 Java_jsynprog_JavaJavaC + 0x00000050 0.590 0.590 Java_jsynprog_JavaJavaC + 0x0000005C 0.580 0.580 Sub_Routine.add_int <HotSpot-compiled leaf instructions> 0.550 0.550 gettimeofday + 0x00000010 0.540 0.540 cfunc(int) + 0x00000048 0.460 0.460 __div64 + 0x00000018 0.390 0.390 cfunc(int) + 0x00000054 0.330 0.330 Java_jsynprog_JavaJavaC + 0x00000058 (er_print)
The Machine Representation shows functions from the JVM itself, rather than from the Java application being interpreted by the JVM, along with any compiled methods and native methods. The machine representation looks the same as that for any other application in a traditional language. The callstack shows JVM frames, native frames, and compiled-method frames. Some of the JVM frames represent transition code between interpreted Java code, compiled Java code, and native code.
In the machine representation, multiple HotSpot compilations of a given method will be shown as completely independent functions, although the functions will all have the same name. All functions from the JVM will be shown as such.
Source from compiled methods will be shown against the Java source code; the data will represent the specific instance of the compiled-method selected. Disassembly for compiled methods will show the generated machine assembler code, not the Java bytecode. Callers-callees relationships will show all overhead frames, and all frames representing the transitions between interpreted, compiled, and native methods.
The Timeline in the machine representation will show bars for all threads LWPs or CPUs, and the callstack in each will be the machine representation callstack.
In the machine representation, memory is allocated and deallocated by the JVM, typically in very large chunks. Memory allocation from the Java code is handled entirely by the JVM and its garbage collector. Heap tracing will not show JVM allocations, since they are done by mapping memory, rather than calling the normal heap routines, and will not show any information about the Java code memory allocation and garbage collection.
In the machine representation,
thread synchronization devolves into calls to _lwp_mutex_lock,
and no synchronization data is shown, since these calls are not
traced.
$er_print testcase.er (er_print) limit 20 (er_print) javamode off (er_print) functions Functions sorted by metric: Exclusive User CPU Time Excl. Incl. Name User CPU User CPU sec. sec. 46.372 46.372 <Total> 7.095 7.095 Routine.add_double 6.324 6.324 java.util.Vector.remove 5.474 5.474 Routine$1JInner.buildlist 4.373 4.403 instanceKlass::allocate_objArray(int,int,Thread*) 3.733 3.733 Java_jsynprog_JavaJavaC 3.713 3.713 cfunc(int) 1.881 1.881 Routine.add_int 1.421 1.421 Routine.bounce 1.421 1.421 Routine.recurse 1.331 1.331 Sub_Routine.addcall 0.911 0.911 jsynprog.javafunc 0.801 0.911 constantPoolKlass::oop_follow_contents(oopDesc*) 0.710 0.710 gettimeofday 0.580 0.580 Sub_Routine.add_int 0.560 0.560 constantPoolKlass::oop_adjust_pointers(oopDesc*) 0.530 0.530 __div64 0.360 0.580 ContiguousSpace::prepare_for_compaction(CompactPoint*) 0.280 0.280 SymbolTable::unlink() 0.270 0.270 MarkSweep::AdjustPointerClosure::do_oop(oopDesc**) (er_print)
(er_print) pcs Objects sorted by metric: Exclusive User CPU Time Excl. Incl. Name User CPU User CPU sec. sec. 46.372 46.372 <Total> 5.034 5.034 java.util.Vector.remove + 0x000001D8 3.933 3.933 instanceKlass::allocate_objArray(int,int,Thread*) + 0x00000304 3.673 3.673 Routine.add_double + 0x00000088 1.851 1.851 cfunc(int) + 0x00000038 1.551 1.551 Java_jsynprog_JavaJavaC + 0x00000040 1.431 1.431 Routine$1JInner.buildlist + 0x00000078 1.411 1.411 Routine$1JInner.buildlist + 0x00000090 1.321 1.321 Routine$1JInner.buildlist + 0x00000084 1.311 1.311 Routine$1JInner.buildlist + 0x00000074 1.241 1.241 java.util.Vector.remove + 0x000001D0 0.841 0.841 Routine.bounce + 0x00000088 0.690 0.690 Routine.recurse + 0x00000088 0.590 0.590 Java_jsynprog_JavaJavaC + 0x00000050 0.590 0.590 Java_jsynprog_JavaJavaC + 0x0000005C 0.570 0.570 Routine.add_double + 0x00000078 0.560 0.560 Routine.add_double + 0x00000084 0.560 0.560 Routine.add_double + 0x00000090 0.560 0.560 gettimeofday + 0x00000010 0.540 0.540 cfunc(int) + 0x00000048 (er_print)
If you want to programatically control data
collection, the Sun ONE Studio 8 Compiler Collection Performance
Tools include some API functions which can be used through the
CollectorAPI class. To use these APIs, simply import the
CollectorAPI package into your Java program:
import
com.sun.forte.st.collector.CollectorAPI;
The application must then be invoked with a
classpath pointing to <install_dir>/lib/libcollector.jar
where <install_dir> is the directory into which
the Sun ONE Studio 8 Compiler Collection was installed:
$javac -classpath
/opt/SUNWspro/lib/collector.jar:. jsynprog.java
The Java class CollectorAPI
defines the following methods:
CollectorAPI.sample(String name)
CollectorAPI.pause()
CollectorAPI.resume()
CollectorAPI.threadPause(Thread thread)
CollectorAPI.threadResume(Thread thread)
CollectorAPI.terminate()
In order to collect performance data you must run your application using the Collector. Inserting the API methods does not enable data collection by itself.
Apart from CollectorAPI.ThreadPause()
and CollectorAPI.ThreadResume(), the API methods apply
to the application and not to individual threads. Therefore, in a
multi-threaded application you should make sure that only one thread
makes calls to the CollectorAPI methods.
DOC ID# 1840