Monday, August 27, 2007

Java for SysAdmins


Im a SysAdmin, taking care of all our UNIX installations, maintenance, software installation, security, performance problems and all the other bits, making sure all our services are up and running nicely.

This year was a very busy one with lot of changes in our datacenter: we moved to Solaris 10, started to use zones and learn DTrace. On top of these things we needed to ensure all our services were up and running no matter what. And here comes the fun. Most of our services are build as J2EE applications running on top of a application server. Lately we started to see a lot more alerts than we used to have...
And it is not really fun when it is 3AM and your site does not work. I had the painful job to debug and try to find out whats wrong with these services. Administering an application server is not exactly the same you would look a sendmail server or a busy database server. Things are a bit different and you will need proper lenses to observe the things.

Be prepared cause more and more Java applications will affect your SysAdmin life :)

Major problems:
  • A lot of abstractions. You have the OS, the JVM, the app server, the application and dozen of other 3rd parties components used by the application. To debug all this stack is a pain unless you are setting the proper glasses to see where the problem comes from. Prepare yourself with a lot of patience.
  • JVM Tuning: Most of our J2EE applications were simple deployed to PROD env without a proper planning and tuning phase. I have discovered that the JVM layer was simple ignored - running on default options. As well it was somehow confusing who should be in charge of tuning the JVM: the Java development team, or our support unit. We ended up using a cooperation of these two: development and support teams.
  • JVM core dumps: sometimes you will see what you will never expect: the entire application goes holiday, your JVM simple dies with a core dump. What you gonna do next !?
  • Outsourcing: it is far more complicated in 2007 to find out who wrote that component or module, who has integrated into the core product or who has correctly ran a regression against the new version of the product. As teams are spread around continents it is far more important the documentation part, which for majority of the projects is almost non existent.
Fixing the problems. A long pain process in order to get all these things fixed and a lot of hours spend on issues not very well orientated as system administration. Some issues are more process orientated, some other are simple technical standard procedures which needed to be in place and proper documented. Here are some of them:
  • JDK versions: it is not evil to keep up to date your JDK versions. Some folks have the feeling that they shouldn't change at all the JDK version delivered by the app server vendor. So lets say you use Java 5, like jdk 1.5.0_06 so make sure you are running the latest _12, for instance. Setup a process to keep up to date the JDK. Same applies to jdk 1.4.2, or jdk 6.
  • JVM tuning: I started to setup a simple procedure to be able to sleep and have a peaceful night. Check it out
  • Debug tools - We needed smart tools in operations: pstack, jstack, prstat, dtrace, plockstat will save your day a bit if you know how to use them against a JVM. Remember, your application looks like a big and fat 32bit or 64bit process. You gotta look inside and see where the problem comes from. Checking as well the app server log is another way to solve the problem, but very much depends how much trash the log contains and how abstract it is.
  • Java 5 or 6 ? It seems a bit complicated to simple just move from Java 5 to 6. Your app server vendor might not like this so check it out before trying it. We are stuck in Java 5 and Java 1.4.2.
  • Thread dumps - A normal operation procedure: generate a thread dump, run a 'kill -3 appserver_jvm_pid' and examine closely the thread dump. when your application looks like dead most likely this is the right time for it.
  • prstat -mL -p appserver_jvm_pid - will set you free. You can easily see each lwp and the CPU consumption, LAT, LCK. A good way to see what your lwps are doing. prstat is more capable comparing with top.
  • plockstat - a new guy in town. Based on DTrace this will report user level lock statistics. Very useful in case your application seems to do nothing. We are currently developing some scripts using it.
As many of you out there are starting to administer more and more J2EE applications you gotta setup a plan and work smart in order to be able to sleep. Try the JVM tuning procedure inside your project and let me know how it went. I bet many of you are already using some sort of JVM tuning procedure, I would be glad to hear opinions and comments.

Peaceful summer,
Stefan