Collecting diagnostic information for BPM

From time to time, you may experience some kind of issue in your BPM environment. Issues could be caused by a wide variety of reasons – changes to the environment, the pattern of load on the environment, product defects, bad process design, insufficient resources allocated to the environment, network instability – just to name a few!

When something goes wrong, it is important to know how to collect the diagnostic information that will be needed to analyse the problem, work out the root cause, and come up with a resolution. In some cases, you may be able to do this analysis yourself. In other cases you may need to involve specialists like network engineers, directory administrators, or Oracle Support, for example.

Let’s take a look at the kinds of diagnostic information that may be needed. Of course, it may not be necessary to collect all of these for any given issue. If you are unsure, then it is a good idea to collect them anyway, just in case you need them.

Note: The purpose of this article is to tell you how to collect the data, not how to analyse it. Sometimes that analysis requires specialist skills and experience, but even then, those specialists rely on having access to the data.

BPM server/cluster configuration files

The first thing that you will want to collect is the configuration files for your environment. There are many different types of configurations that are possible, and these files contain the information necessary for someone to understand exactly how your particular environment is configured.

These files are located inside your WebLogic domain’s home directory, in the config directory. You will see files and directories like this:

config
|-- config.xml
|-- configCache
|-- deployments
|-- diagnostics
|-- fmwconfig
|-- jdbc
|-- jms
|-- nodemanager
|-- security
`-- startup

You can just zip up this whole directory to collect the files. You might use a command like this for example:

tar xzvf bpm_config.tar.gz /home/oracle/fmwhome/user_projects/domains/base_domain/config

Note: All of the examples in this post show the Oracle Middleware home as /home/oracle/fmwhome and the WebLogic domain name as base_domain. You will need to adjust these to suit your own environment.

The next useful piece of information to capture is a list of which patches (if any) you have installed in your environment. The best way to collect this information is to capture the output of the opatch lsinventory command. You should run this twice, first with ORACLE_HOME set to the Oracle_SOA1 directory under your install directory, and second with it set to the oracle_common directory under your install directory.

The example below shows running the opatch lsinventory command for ORACLE_HOME=/home/oracle/fmwhome/Oracle_SOA1 and the output, which in this case shows that no patches have been installed. In this example, you would also run it again with ORACLE_HOME=/home/oracle/fmwhome/oracle_common.

[oracle@ps5 Oracle_SOA1]$ export ORACLE_HOME=/home/oracle/fmwhome/Oracle_SOA1
[oracle@ps5 Oracle_SOA1]$ export PATH=$ORACLE_HOME/OPatch:$PATH
[oracle@ps5 Oracle_SOA1]$ opatch lsinventory
Oracle Interim Patch Installer version 11.1.0.9.0
Copyright (c) 2011, Oracle Corporation.  All rights reserved.

Oracle Home       : /home/oracle/fmwhome/Oracle_SOA1
Central Inventory : /home/oracle/oraInventory
   from           : /home/oracle/fmwhome/Oracle_SOA1/oraInst.loc
OPatch version    : 11.1.0.9.0
OUI version       : 11.1.0.9.0
OUI location      : /home/oracle/fmwhome/Oracle_SOA1/oui
Log file location : /home/oracle/fmwhome/Oracle_SOA1/cfgtoollogs/opatch/opatch2012-12-20_11-18-33AM_1.log

Patch history file: /home/oracle/fmwhome/Oracle_SOA1/cfgtoollogs/opatch/opatch_history.txt

OPatch detects the Middleware Home as "/home/oracle/fmwhome"

Lsinventory Output file location : /home/oracle/fmwhome/Oracle_SOA1/cfgtoollogs/opatch/lsinv/lsinventory2012-12-20_11-18-33AM.txt

--------------------------------------------------------------------------------
Installed Top-level Products (1): 

Oracle SOA Suite 11g                                                 11.1.1.6.0
There are 1 products installed in this Oracle Home.

There are no Interim patches installed in this Oracle Home.

--------------------------------------------------------------------------------

OPatch succeeded.

BPM log files

The information we have collected already is generic in nature and is used to ensure the domain configuration is correct and there are no obvious problems. From this point on, we are looking at information that is used to analyse a specific problem.

The server log and ‘out’ files are often the very first place we will look when there is a problem. These files will usually contain error messages that ,will give some information about the cause of the problem.

You can use a command like this to collect the logs. Remember to collect the logs from your AdminServer and each of your managed servers.

tar xzvf soa_server1_logs.tar.gz /home/oracle/fmwhome/user_projects/domains/base_domain/servers/soa_server1/logs

This will also collect the diagnostic_images if there are any available. These provide additional information about certain problems.

It is important to understand that a problem may occur only on one server, or on a number of servers. This is why it is important to collect the logs from all of the servers. Sometimes it is necessary to analyse data from several sources in order to understand what was happening in the environment.

Sometimes, during the analysis of a problem, you may be asked to turn on some debug/trace settings and attempt to recreate the problem. If this happens, the output from those traces almost always end up in these logs.

Incident logs

WebLogic collects some data by default when various ‘incidents’ occur, for example when a ‘stuck thread’ is encountered. The data collected depends on the incident, but it usually contains things like thread dumps, logs, and error messages.

These data are stored inside the server directories in your domain directory. To collect them, you could use a command like the example below. Remember to collect the incident logs for you AdminServer and each of your managed servers.

tar xzvf incident_logs.tar.gz /home/oracle/fmwhome/user_projects/domains/base_domain/servers/soa_server1/adr/diag/ofm/base_domain/soa_server1/incident

Thread dumps

A thread dump is a snapshot of what is happening in the server at a particular point in time. It allows us to see what each thread in the server process is doing. This information is helpful to understand how the server is behaving and what it is doing.

You can take a thread dump in a variety of ways, and how you do it depends on your operating system, how you started the server, e.g. whether you started it from a command line or the node manager, and if the server has become unresponsive.

Here are some of the common ways to take a thread dump:

  • Pressing Ctrl-Break on Windows, or Ctrl-\ on Linux/Solaris/etc. in the window running the WebLogic process (in the foreground),
  • Sending signal 3 (SIGQUIT) to the process (kill -3 PID),
  • Connecting to the process with a utility like jvisualvm and pressing the Thread Dump button:bpmdiag1
  • Requesting a thread dump in the WebLogic Server console by navigating to the server, then the Monitoring tab and the Threads sub-tab and pressing the Dump Thread Stacks button:bpmdiag2
  • Use jstack PID (or jrcmd PID print_threads for JRockit).

Most of the time, more than one thread dump will be required. A series of thread dumps over some time period are needed in order to understand how the server is behaving over time. For example, a thread dump might show that a particular thread is ‘stuck’. Another (later) thread dump will be needed to see if that thread becomes unstuck by itself later on (as commonly happens) or not. Thus the two thread dumps together would be necessary to determine if the stuck thread was a problem or not.

It is also important to take thread dumps on all of the servers that are (or could possibly be) affected by or contributing to the problem. If in doubt, take thread dumps on all of the servers.

As a general rule of thumb, you should take five thread dumps over a period of time. How do you work out a suitable period of time? If you have a specific problem, for example you see some error message and then a minute later all of your servers become unresponsive, then the time period is that minute. Take a thread dump when you first see the error message appear, then one every 20 seconds (or so). If you don’t have any way to guess the suitable time period, just take them a minute apart.

The example below shows what the output from the thread dump looks like. Note that many lines have been removed from this output.

2012-12-31 10:26:12
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.10-b01 mixed mode):

"JMX server connection timeout 48" daemon prio=10 tid=0x00007fabf8006800 nid=0x232f in Object.wait() [0x00007fac330b4000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000000f61ad8c0> (a [I)
	at com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
	- locked <0x00000000f61ad8c0> (a [I)
	at java.lang.Thread.run(Thread.java:662)

(many lines deleted)

"main" prio=10 tid=0x00007facc4008800 nid=0x21db in Object.wait() [0x00007facc9f38000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000000e0bb22a0> (a weblogic.t3.srvr.T3Srvr)
	at java.lang.Object.wait(Object.java:485)
	at weblogic.t3.srvr.T3Srvr.waitForDeath(T3Srvr.java:981)
	- locked <0x00000000e0bb22a0> (a weblogic.t3.srvr.T3Srvr)
	at weblogic.t3.srvr.T3Srvr.run(T3Srvr.java:490)
	at weblogic.Server.main(Server.java:71)

"VM Thread" prio=10 tid=0x00007facc406e000 nid=0x21e4 runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007facc401b800 nid=0x21dc runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007facc401d800 nid=0x21dd runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007facc40ad000 nid=0x21eb waiting on condition 

JNI global references: 1601

Heap
 PSYoungGen      total 111744K, used 71412K [0x00000000f5560000, 0x00000000fdaa0000, 0x0000000100000000)
  eden space 89856K, 76% used [0x00000000f5560000,0x00000000f985f160,0x00000000fad20000)
  from space 21888K, 12% used [0x00000000fc540000,0x00000000fc7fe198,0x00000000fdaa0000)
  to   space 23296K, 0% used [0x00000000fad20000,0x00000000fad20000,0x00000000fc3e0000)
 PSOldGen        total 174784K, used 65947K [0x00000000e0000000, 0x00000000eaab0000, 0x00000000f5560000)
  object space 174784K, 37% used [0x00000000e0000000,0x00000000e4066c60,0x00000000eaab0000)
 PSPermGen       total 131072K, used 125060K [0x00000000d0000000, 0x00000000d8000000, 0x00000000e0000000)
  object space 131072K, 95% used [0x00000000d0000000,0x00000000d7a211b8,0x00000000d8000000)

Heap dumps

Another kind of dump that may be required for some problems is a heap dump. A heap dump is essentially a copy of everything that the JVM has in memory (in the heap) at a particular point in time. These are usually going to be pretty big files – they will be at least as big as the amount of used heap. So if you are running your BPM managed server with an 8GB heap, and it is 75% in use when you take the heap dump, then the heap dump is going to be about 6GB in size.

Heap dumps are used to look at the contents of the JVM’s memory in detail. They allow us to look at every object in the JVM and see the state of those objects.

Heap dumps are often used to diagnose a class of problems called ‘memory leaks’. While a single heap dump can lead us to suspect a memory leak, two heap dumps (from the same JVM at different times) are needed to confirm that a memory leak actually exists.

Heap dumps are also useful for other kinds of problems, where we need to look at the contents of various objects to understand what the server is doing.

It is a good practice to collect heap dumps when problems occur, but you should not send them to Oracle unless they are requested. Since they are so large, you may also wish to compress them and delete them after the problem they relate to has been resolved.

You can generate a heap dump from a tool like jvisualvm (by pressing the Heap Dump button) as shown below:

bpmdiag3

You can also collect a heap dump using jmap using a command like the one below:

jmap -dump:format=b,file=heap_dump_1.bin pid

If the problem is suspected to be a memory leak, you may be asked to carry out the following steps:

  • allow the server to come to a steady state after startup,
  • perform six full garbage collections (by pressing the Perform GC button, next to the Heap Dump button, six times),
  • take a heap dump,
  • attempt to reproduce the issue, i.e. do whatever it is you do to make the problem occur,
  • take another heap dump.

Another good practice is to ensure that you have configured WebLogic to automatically take a heap dump if it runs out of memory. This is done by adding the following parameter to the JVM:

-XX:+HeapDumpOnOutOfMemoryError

This setting often saves a lot of pain – if your server crashes because it ran out of memory, then this setting is pretty likely to capture the information needed to work out what went wrong. If you do not have this setting, you would need to add it, and wait for the problem to happen again. It is safe to have this setting on all of your production servers. Note that it takes some time to take a heap dump (how long depends on the size of the heap and the speed of your disks) so there is a trade-off here – collecting the information needed to fix the problem will mean that your server restart will take a bit longer, as you will have to wait for the heap dump to finish before you restart the server(s).

Garbage Collection logs

Garbage collection logs are very useful for analysing memory related issues. The JVM will not produce these logs by default, you need to tell it to produce them.

These three settings will cause the JVM to print out more detailed information about garbage collection and to produce a log (called gc.log in this example) that contains garbage collection statistics and information that is very useful when trying to do some JVM tuning:

    -XX:+PrintGCTimeStamps
    -XX:+PrintGCDetails
    -Xloggc:gc.log

And, as mentioned in the previous section, it is also a good idea to turn on this setting:

    -XX:+HeapDumpOnOutOfMemoryError

These settings are safe to leave on all the time in your production environment.

Database information – AWR reports

Many performance related issues may have to do with the underlying database. For this reason, it is important to capture some information about the database performance as well. You should collect the AWR reports for the same period during which you observed the problem in BPM. To be on the safe side, start a little earlier and end a little later. For example, if the problem occurred from 10am until noon, you might collect AWR reports from 9am to 1pm.

You can find more information about what AWR reports are and how to collect them in this post.

HTTP Server logs

For some kinds of problems, it is useful to see the logs from the HTTP Server (if any) which is in front of your BPM server or cluster. These are often useful if you are getting refused connections for example.

You should gather the following logs:

  access.log  
  error.log

If you are using Oracle Web Tier (or Oracle HTTP Server), these logs will be located in the following directory, assuming your Oracle Web Tier Home is /home/oracle/httphome and you used the default names for the instance:

/home/oracle/httphome/Oracle_WT1/instances/instance1/diagnostics/logs/OHS/ohs1

Debug logs for the WebLogic plugin may also be useful if you are seeing nodes being evicted from the cluster or if you suspect that the cluster is unbalanced – e.g. you can see a different number of sessions on each node in the cluster.

To obtain these, you need to set DEBUG=ALL in the httpd-vhosts.conf file. This will produce a log called wlproxy.log.

Operating system level information

Sometimes performance information from the operating system level can be helpful as well. You might want to consider using tools like top or prstat (with thread/’lightweight process’ support), sar, vmstat, mpstat, iostat, and netstat. If you are have a possibly network-related issue, for example loss of communications between cluster members, then tcpdump may also capture useful information.

Remember, if you are running a cluster, you would need to collect these on all nodes in the cluster at the same time.

Java information

There are also several Java tools that can help you to collect additional information. If you are not familiar with these, it might be a good idea to explore what they can do for you. I would suggest looking at jps, jstat, jinfo, jstack, jmap, and jtop.

How to send information to Oracle Support

If you need help with the problem, you should contact Oracle Support and open a Service Request (SR). The Oracle Support system will allow you to upload attachments to the SR so that you can provide information you have collected. If the files are large, like a heap dump for example, then you should upload them to Oracle Support’s FTP server instead. Support will give you instructions on how to access the ftp server and where to put your files.

Posted in Uncategorized | Tagged , , , , , | Leave a comment

BPMN process editor problems in 11.1.1.6 (update)

I wrote some time ago (in this post) about a patch for some issues with the layout in the BPMN process editor in 11.1.1.6.  I know that a lot of folks have contacted Support to ask for the patch that I mentioned in that post, and I know that some of you were told by Support that there was no patch available.

We have worked with Support to fix this problem, and I am happy to say that the patch is available to download from Oracle Support now.  I hope you did not have too much inconvenience.

The Patch number is 13088538: NPE IN O.BPM.UI.LAYOUT.MIGLAYOUT:114.

Posted in Uncategorized | Tagged , , | Leave a comment

A review of Oracle SOA Suite 11g Administrator’s Handbook

Highly recommended, a tour de force.

Packt’s new Oracle SOA Suite 11g Administrator’s Handbook by Ahmed Aboulnaga and Arun Pareek is packed full of essential information for the Oracle SOA administrator, in fact I would go so far as to say that it should be required reading for administrators who are new the the Oracle SOA Suite platform.  I think that reading it would greatly shorten the learning curve and help new administrators avoid many common problems or points of confusion.

More so than any other single piece of content that I have seen on the topic, it provides the information that a SOA administrator needs to know in order to successfully configure, manage, monitor, troubleshoot and backup an Oracle SOA environment.

It is clear and to the point, it presents just the information that you need, and the information is easy to find.  It is not cluttered up with a whole bunch of extra information you don’t need.  It is detailed and technical – providing information that you can use.  I think the book is not only a great introduction for a new administrator who needs to get a feel for Oracle SOA Suite, but it is also a great reference volume to keep on hand, even for experienced administrators.

It is obvious when reading the book that the authors have extensive experience and that they know what is important to their audience.  I have been working with Oracle SOA Suite for several years now, since 10g days, and I am one of the authors of the official Oracle SOA Suite Certification question base, and even I learned things from this book that I did not know.

The book covers topics like managing the SOA infrastructure, managing composite applications, monitoring SOA Suite, tuning, configuration and administration, troubleshooting, security policies, managing MDS and the dehydrations store and backup and recovery.

The bonus online chapter covers important issues like patching, upgrading from 10g, cluster configuration and silent (scripted) installation.

I for one will be keeping this book on my book shelf and I highly recommend it to anyone interested in or working with Oracle SOA Suite, in an administration capacity, or who just wants to know more about the product in general.

Packt Publishing provides reviewers with a free copy of the e-book.
Posted in Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

New ADF Mobile released

Oracle has just released the new Oracle ADF Mobile which allows you to build native applications that will install and run on both iOS and Android devices from the same ADF source code.

Development is done with JDeveloper and ADF and leverages Java and HTML 5 technologies, while keeping the same visual and declarative approach ADF is known for.

You can read more about the Oracle ADF Mobile release here and learn more on its OTN page here.

Posted in Uncategorized | Tagged , , , | Leave a comment

Oracle releases ADF Essentials

In case you missed it, Oracle has released a new free version of ADF called ADF Essentials.  You can find more information in the press release or the online demo.

Posted in Uncategorized | Tagged | Leave a comment

Reading Oracle SOA Suite 11g Administrator’s Handbook

I am reading the new Oracle SOA Suite 11g Administrator’s Handbook by Ahmed Aboulnaga and Arun Pareek.  I am half way through it and I have to say – it is just great!  Will post some detailed comments soon!

Posted in Uncategorized | Leave a comment

Packt celebrating 1,000th title

As many of you will know from my last post, we are very happy to have released our first book with Packt – quite an achievement for us.  But Packt is celebrating an achievement of their own – they are just about to publish their 1,000th title!

To celebrate, Packt are inviting anyone already registered to www.packtpub.com, or who registers before 30th September 2012, to download any one of their eBooks for free. Packt is also opening its online library for a week for free to members, offering customers an easy to way to research their choice of free eBook.

Further details of the event can be found in the Press Release.

If you think a free eBook sounds like a good idea, head over to this link: http://www.packtpub.com/login.

Posted in Uncategorized | Leave a comment

New Book: Oracle BPM Suite 11g: Advanced BPMN Topics

In all the fanfare of the iPhone 5 launch this week, you may have missed a much lower key announcement of another new product which is also now available to pre-order.

Two of your humble RedStack bloggers are proud to announce our very first book – Oracle BPM Suite 11g: Advanced BPMN Topics with Packt Publishing.

This is a concise presentation of both theory and practical examples of the areas of BPMN where we have encountered the most widespread confusion and misunderstanding.  So we hope this book will help in some small way.

Here is a quick overview of the book:

Chapter 1, Inter-process Communication introduces us to the theory of how processes
can communicate with each other and with other components. A number of topics
are covered such as: conversations—what they are, the default and advanced
conversations. We discuss correlation—automatic and message based, correlation
sets and keys, and correlation inside loops and when there are multiple calls. Throw
and catch events, send and receive tasks, and when to use each are examined. We
compare messages, signals, and errors. Sub-processes are explored—embedded,
multi-instance, and reusable, and when to use each.

Chapter 2, Inter-process Communication in Practice presents a series of practical
exercises to help you to explore the theory present in Chapter 1, Inter-process
Communication. The examples include communicating between processes using
messages and correlation, using correlation inside loops, communication between
processes using signals, and reusable sub-processes.

Chapter 3, Working with Arrays presents both theory and several practical exercises on
handling arrays in BPM. Topics include data association, creating an empty array,
creating an array with empty elements, creating an initialized array, getting an element
from an array, setting an element in an array, appending elements to an array, joining
arrays, removing elements from an array, and iterating over arrays—cardinality and
collections, sequential and parallel, completion conditions, and scope.

Chapter 4, Handling Exceptions discusses the theory behind handling exceptions in
BPM. Topics include business and system exceptions, boundary events, event subprocesses, exception propagation with embedded sub-processes, call, throw and
send, and how BPM exceptions affect the SCA composite.

Chapter 5, Handling Exceptions in Practice will guide us through a number of
practical examples that help to reinforce the theory in Chapter 4, Handling Exceptions.
The examples include implementing a timeout use case with boundary events,
implementing a “cancel message” use case, using event sub-processes, and exploring
exception propagation in peer processes.

We hope you enjoy it!

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , | 1 Comment

The curious case of SOA Human Tasks’ auto-completion

Hello,

I have elaborated on the above subject here.

Hope this helps.

Kavitha

 

Posted in Uncategorized | Leave a comment

ADF is getting mobile…

There’s a new ADF Mobile on the way – one that will allow you to build native iOS and Android applications using ADF – not just browser based application.

You can get a preview from these slides and video on OTN.

Posted in Uncategorized | Tagged , | Leave a comment