Java is the predominant language of Big Data technologies. HBase, Lucene, elasticsearch, Cassandra – all are written in Java and, of course, run inside a Java Virtual Machine (JVM). There are some other important Big Data technologies, while not written in Java, also run inside a JVM.
Examples include Apache Storm, which is written in Clojure, and Apache Kafka, which is written in Scala. This makes basic knowledge of the JVM quite important when it comes to deploying and operating Big Data technologies.
There is an enormous amount of information about working with JVMs available on the web, but there are only two main concerns when it comes to working with JVMs. They are Heap size and Garbage Collection (GC).
The heap is the main pool from which the JVM allocates memory. When a JVM program requests memory, the JVM allocates that memory from the heap. If there is no memory available the Java program suffers an OutOMemoryException, and usually exits. The size of the available heap is set by two flags passed to the JVM at start up. These flags are -Xms and -Xmx. The former sets the initial heap size. The latter sets the maximum heap size.
Just because a JVM suffers an OutOfMemoryException does not mean something is wrong with the code or the product. The JVM may be simply be misconfigured, and its launch parameters may require modification.
Why simply increasing the heap size is often the right solution
The default heap size of JVM may not be correct for an application or your use of that application, and increasing the heap size of the JVM often solves the problem of OOM exceptions, assuming that the JVM program is not suffering from a serious memory leak.
Databases and search engines may consume enormous amounts of memory while sorting a large result-set, for example. Combine this with many simultaneous queries, and you can see why heap sizes of multiple GBs are not uncommon when running these technologies.
Why increasing the heap size can also cause problems
Increasing the heap size can result in long periods when the JVM application appears to be completely frozen, sometimes known as stop-the-world pauses or a full GC. At the extremes a full GC could cause a JVM application to stall for a minute or more. This may or may not be fatal, but in the case of clustered systems, it can become quite serious.
Imagine you have a cluster, whose nodes must send out a heartbeat to a master node every 5 seconds. If a node suffers a 30-second stop-the-world pause, the master may believe that node is down. The cluster may then move into a recovery mode, which can involve many significant operations such as data replication. In an extreme case the cluster may suffer a partition and disintegrate.
So systems with large heap allocations must be carefully monitored. GCs are explained further later in this post.
The right way to set the heap
When setting the heap size of a JVM, the best approach is to set -Xmx to a value that avoids OOMs, is still sensible, and then set -Xms to the same value. By doing this the JVM allocates the full heap up-front from the OS. This avoids long start-up cycles, whereby the JVM application regularly re-allocates from OS-level RAM until it reaches the level set by -Xmx. Do yourself a favour and set both to the same value.
How to monitor heap usage
jmap comes with the Sun JDK, and can be used to check heap usage and generate a heap dump. A heap dump is mostly useful to the developers of a given open-source technology, but monitoring heap usage is very important for all of us.
There are other graphical tools available, but jmap is always available.
Analyzing heap usage
Check out heap usage as follows.
jmap -heap [process ID of the JVM application]
A healthy JVM means heap usage under 80%. Heap usage at 90% or greater may mean an OOM condition is imminent.
To generate a heap dump issue the following command
jmap -heap:format=b [process-id]
Garbage Collection is the process by which previously allocated memory, which is no longer needed, is returned to the heap for future allocation. Oracle have written an excellent paper on this topic and I recommend it to anyone who wants to understand this topic.
How to monitor Garbage Collection
The easiest way to monitor GC is to enable verbose GC logging when launching the JVM. The options to pass to the JVM at launch time are as follows:
Then, by simply tailing the GC log file, you can see the minor and major (full) GC cycles taking place. Furthermore this file can be fed into tools like GCviewer, and GC usage plotted. Other tools such a jvisualvm also come with the Sun JDK.
Signs of excessive Garbage Collection
A healthy JVM should be performing full GCs on the order of every few minutes. If a JVM is performing full GCs every few seconds then the system is almost certainly in trouble. The CPU usage of the host machine is almost certainly quite high too, as full GCs can be computationally intensive. Most likely the heap is too small and needs to be increased.