Monday, October 31, 2011

Understanding Hibernate - Part I


A simple pojo class representing a customer

packageorg.training.hibernate;

importjava.io.Serializable;
importjava.util.List;
importjava.util.Set;

publicclassCustomerimplementsSerializable {
privateintcustomerID;
privateString customerName;
privateList<String> customerPhoneNumbers;
privateSet<String> customerAddress;
publicCustomer() {
// TODOAuto-generated constructor stub
}
publicCustomer(intcustomerID, String customerName,
List<String> customerPhoneNumbers, Set<String> customerAddress) {
super();
this.customerID= customerID;
this.customerName= customerName;
this.customerPhoneNumbers= customerPhoneNumbers;
this.customerAddress= customerAddress;
}

publicintgetCustomerID() {
returncustomerID;
}

publicvoidsetCustomerID(intcustomerID) {
this.customerID= customerID;
}

publicString getCustomerName() {
returncustomerName;
}

publicvoidsetCustomerName(String customerName) {
this.customerName= customerName;
}

publicList<String> getCustomerPhoneNumbers() {
returncustomerPhoneNumbers;
}

publicvoidsetCustomerPhoneNumbers(List<String> customerPhoneNumbers) {
this.customerPhoneNumbers= customerPhoneNumbers;
}

publicSet<String> getCustomerAddress() {
returncustomerAddress;
}

publicvoidsetCustomerAddress(Set<String> customerAddress) {
this.customerAddress= customerAddress;
}
}

Friday, October 21, 2011

Apache server Performance Optimization


Squeezing the most performance out of your Apache server can make difference in how your Web site functions and the impression it makes. Even fractions of a second matter, especially on dynamic sites. This article looks primarily at configuration and installation, two areas where you have the most control.

Measuring and Improving Apache Server Performance:
Apache was designed to be as fast as possible. It's easy, with a fairly low-powered machine, to completely saturate a low-end Internet link with little effort. However, as sites become more complex and the bandwidth needs of different connection types increase, getting the best performance out of an Apache installation and Web sites becomes more important.

Enhancing performance means nothing if the changes achieved are only minor gains. Spending hours or even days finely tuning a server for just a few percentage points is a waste of time. The first step, therefore, is to determine how fast the server is running and its general performance level so you can work out how to improve performance and measure the changes.
This is not the first time we've discussed Apache testing (see Staying Out of Deep Water: Performance Testing Using HTTPD-Test's Flood). As was noted previously, determining which parts of your Web application are causing the problem — particularly identifying whether it's Apache or the application environment you are using with dynamic sites — can be difficult. Identifying problems in dynamic applications is beyond the scope of this article, but we will look at ways to generally improve the speed of Apache and how it interacts with other components to support a Web site.

Apache Server Host Hardware
The machine and operating system environment on which Apache is running have the most effect. Obviously, an old 386-based PC will not have the same performance as a new P4 or dual-processor model, but you can make other improvements. Avoiding, for the moment, hardware changes, the biggest thing you can do is ensure Apache is running on a dedicated server. Coexistence with other applications will affect Web server performance.
In most situations, but particularly with static sites, the amount of RAM is a critical factor because it will affect how much information Apache can cache. The more information that can be cached, the less Apache has to rely on the comparatively slow process of opening and reading from a file on disk. If the site relies mostly on static files, consider using the mod_cache; if plenty of RAM is available, consider mod_mem_cache.
The former caches information to disk, which makes a significant difference if the site relies on mod_include to build up a page as it caches the final version. With mod_mem_cache, the information is stored in the memory heap shared by all the Apache processes. 
Using a fast disk or, better still, a Redundant Array of Inexpensive Disks (RAID) solution in one of the striping modes (e.g.,RAID 0, 0+1, 5, 10, or 50) will improve the overall speed of access to files served.
Note, however, if you do go down any of these routes, a hardware, rather than software solution is the best option.
Finally, in terms of hardware, CPU power can have an impact on dynamic sites with the additional overhead of executing an application for each page accessed. Heavily dynamic pages have a higher CPU requirement. 
Apache Server Host Environment
Regardless of operating system, the following optimization principles apply:
•               Keep other background applications to a minimum. If you are really serious about performance, this should even include some background processes that some would consider vital. For example, in Unix, switch off NFS, any printing services, and even sendmail if it's not needed. Under Windows, use the System control panel to optimize the system for applications and system cache, and optimize the system for performance. Make sure, of course, any required applications or services, like MySQL are still running.
•               Avoid using the system. If you start compiling applications, editing files, or otherwise employing the machine, you'll reduce its Web serving performance. If you must edit components or install software, build or edit the components on another machine and copy them over.
•               Keep your system up to date. Although a good idea just from a security point of view, software patches and updates can make significant improvements to network and I/O performance.
The Apache Server Application
Then, of course, there is the Apache application itself.
First, ensure it is built correctly with only the modules and extensions required for your Web sites. This means, for example, you can ignore the rewriting module if it's not required. The main benefit of this is a reduction in memory overhead, but a very good side benefit is that you can't accidentally enable these options and therefore reduce server performance. 
Static vs. Dynamic
Flexibility is the primary concern of most Apache administrators, but flexibility has a cost. Using Dynamically loaded modules within Apache is a convenience, but using them can result in a performance hit, as the code is loaded when the module is required. Dynamic modules also have the advantage of helping keep memory requirements down.
To build in static mode, use the configure script and specify the modules you want, but don't specify them as shared (e.g., use --enable-rewrite not --enable-rewrite=shared, or use the shared option --enable-so. 
Module Configuration
If you are using a static configuration of Apache, choose the modules you wish to incorporate with care. Using static mode comes at a price — the more modules, the more memory you use. Thus, a forked multi-processing module can have a significant effect on the machine's memory requirements.
Note that some items are automatically included, so you'll need to explicitly enable and disable needed modules. Also remember to include any third-party modules (e.g., authentication, PHP, or mod_perl), the Web service requires. Use configure --help to get a list of the available options. 
Apache Server Configuration
Once your environment is set up and your Apache application optimized, it's time to start looking at the configuration file for further optimization tricks. A good way to start is by simply cleaning up the file so directives are limited to a few hundred, which is achieved by simply removing the comments. Beyond this, it becomes a case of removing unnecessary elements or those that fail to provide any appreciable benefit. 
Simplifying the Configuration File
The first step to optimization should be the simplification of the configuration file. It will not have any direct improvement on performance, but it will make the configuration file easier to use and therefore make you less likely to miss a directive or component that needs modifying.

If you are doing any kind of optimization, start with one of the default-supplied configuration files. They are usually available in the Apache configuration directory as httpd.conf.orig or httpd-std.conf. Don't be tempted to use the high performance-std.conf file; in the long term it's not really as useful as you would think once you start adding vast quantities of additional configuration information. On the other hand, if a very fast static Web server is the goal, this is probably the easiest way to get things up and running. 
If you know your Apache configuration directives, or are willing to look at the documentation, the quickest and most effective step is to remove all comments from the configuration file, as they often detract from the actual directives. You can also remove references to MPM systems not in use on the chosen platform. 
Disabling Components and Systems
Now that we've got a trimmed-back and simplified configuration file, we can start removing the configuration elements for the systems not in use. In particular:
•               HostnameLookups add overhead to each request by requesting DNS lookup on the client, first reverse to find the name from the IP address, and then a forward look up to ensure information is not spoofed. In most cases, you can simply disable this. If you regularly process your logs, use post-processing to determine the information. To disable lookups, include the following directive HostnameLookups off.
•               Symbolic links, when enabled, will ensure Apache checks every request to see if a symbolic link is involved in the request. There will be one call to the lstat() system call for each directory to which the request relates. Unless you have a need for symbolic links, switch it off by using: Options -FollowSymLinks
•               Server status and info, although very useful when testing and monitoring your server, create additional overhead for the Web server. Disable it by looking for any SetHandler server-status directives, and, if possible, remove the module from Apache when you configure the application during build.
•               Wildcards and flexible options should generally be avoided if you can be more explicit. For example, the DirectoryIndex directive, explicitly specifies the list of files to be configured, always listing the most likely choice first.
•               CGI execution should take place unless you have good reason for not doing so. Put all CGI files into a single directory and configure it for CGI execution. This prevents Apache from trying to determine whether a request is actually for a CGI component or a static file. 
Disable Logs
Writing log information is a time consuming process. Although Apache keeps the log files open so that it's just a case of writing the information, this can take up valuable time. If storing log information is not required, you can save a few processor cycles by disabling it. To do this, simply comment out the log lines in the configuration file.
If you do decide to keep your logs, disable HostnameLookups (see above) and make sure you copy the log information on to another machine to parse the file for analysis. 
Simplify Directory-level Configurations
The .htaccess files are an incredibly useful way of extending the configurable parameters of your Apache server without having to edit the main configuration file each time you want to change something. The problem is that the use of .htaccess files also slows down the server. 
First, it has to look to see if a .htaccess file exists, then it has to parse and process the elements before finally applying the configuration to the directory in question. Worse still, Apache must determine this information not only for the current directory, but also for any parent directories and it then must make the changes based on the contents of all these files.
If you want maximum performance however, you should disable the use of .htaccess files altogether. Any directory specific configuration can go in the main configuration file where it can be parsed once by Apache when the server starts. 
To disable .htaccess add the directive AllowOverride None to any section.

MPM Configuration
The Multi-Processing Module (MPM) is what enables a specific platform to handle multiple concurrent connections. MPM modules are platform specific. Solutions are available to work specifically with Unix, Windows, BeOS, and NetWare. For some platforms more than one alternative is available. For most users, the default configuration for a particular environment works fine, especially when getting the exact parameters correct can be a time-consuming task in and of itself. By comparison, many of the techniques already described may yield better performance, but when you want to squeeze the maximum performance out of your server, you must adjust the configuration. 
Under most platforms only MPM is available, under Unix there are two options, prefork and worker. The prefork MPM forks off a number of identical Apache processes, while the worker creates multiple threads. In general, prefork is better on systems with one or two processors where the operating systems is better geared toward time slicing between multiple processes. On a system with a higher number of CPUs the threading model will probably be more effective.
In nearly all cases, the MaxClients directive is the most effective for increasing server performance, as it controls that maximum number of simultaneous connections Apache can handle.
Optimizing Static Components
If your Web site uses a lot of static components, or if you've split the static and dynamic elements across two or more Web servers, then your main goal should be to improve the response time for Apache sending back the information that was requested. The easiest way to do this is to use the mod_cache module. You can use this with the mod_disk_cache and mod_mem_cache to provide disk-based and memory-based caches of the static files.
Check out the Apache documentation on the mod_cache module for more information.
Optimizing Dynamic Components
Dynamic components are probably the most time-sapping component of any Web server. Dynamic components, especially if you are using CGI, can add seconds to the response time just to load and execute a simple application. A more system options can be found at mod_perl, PHP, and Python, and the Jakarta interface for Java.
The main advantage of the script-based solutions is that they embed the interpreter into the Apache executable, which removes the initial loading problem with dynamic scripts. Some will even cache the parsed script so the next time it's requested it need only to be executed. 
Configuration can be complex and getting the exact system correct can be time consuming. Some solutions also don't work quite as one would expect with virtual hosts, and you will need to change certain scripts to take full advantage of the speed enhancements on offer. 
The improvements, however, can be significant, with as much as 70 percent of the execution time being knocked off of a Perl script simply by using mod_perl in place of CGI. With even more work, these solutions also allow you to keep persistent connections open to databases or to cache information between requests. This is great for e-commerce sites and also for reducing the overhead of otherwise loading information between requests.
Summary
Although Apache is highly configurable and a relatively complex application, it's interesting to note that standard installations of Apache actually achieve very high levels of performance. One area where you can easily and significantly improve performance is by tuning parameters. Unfortunately, often the components you have least control over within Apache — dynamic elements and CGI scripts, for example — are the ones that have the biggest impact on performance. Monitor a typical Apache server and you'll see that the time taken for Apache to answer a connection and send data back is in the range of milliseconds — but waiting for the source of that data can take seconds.
This is not to say the optimizations we've highlighted are pointless, however. During the course of a day these saved milliseconds add up. More significant though is that cleaning up and simplifying your Apache configuration will do more to reduce the administration overhead than any time you might save when serving information.

Tuesday, October 4, 2011

Common Maven Error


Most of the time when I work with maven I encounter this error:

Reason: Error getting POM for 'org.apache.maven.plugins:maven-eclipse-plugin' from the repository: Failed to resolve artifact, possibly due to a repository list that is not appropriately equipped forthis artifact's metadata.
org.apache.maven.plugins:maven-eclipse-plugin:pom:2.9-SNAPSHOT

As it happens with other maven dependencies their pom.xml file gets somehow modified and the plugin does not work as expected anymore.

The solution here is to navigate to your local repository, delete the dependency and try to download it again. If it still fails (clearly a pom.xml that has not been corrected) then you need to edit the metadata file. In my case:

/Users/sridhar/.m2/repository/org/apache/maven/plugins/maven-eclipse-plugin/maven-metadata-central.xml

You will notice the "latest node", for example:
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-eclipse-plugin</artifactId>
  <versioning>
    <latest>2.9-SNAPSHOT</latest>
    <release>2.8</release>
    <versions>
      <version>2.0-beta-1</version>
      <version>2.0-beta-2</version>
      <version>2.0</version>
      <version>2.1</version>
      <version>2.2</version>
      <version>2.3</version>
      <version>2.4</version>
      <version>2.5</version>
      <version>2.5.1</version>
      <version>2.5.2-SNAPSHOT</version>
      <version>2.6-SNAPSHOT</version>
      <version>2.6</version>
      <version>2.6.1-SNAPSHOT</version>
      <version>2.7-SNAPSHOT</version>
      <version>2.7</version>
      <version>2.8-SNAPSHOT</version>
      <version>2.8</version>
      <version>2.9-SNAPSHOT</version>
    </versions>
    <lastUpdated>20101028062425</lastUpdated>
  </versioning>
</metadata>
Just remove the "latest" node or edit to reflect the version you want to use and problem solved

Monday, October 3, 2011

SQL or NoSQL ?


NoSQL is to consider scalability - the classical non-functional architectural concern. In a classical OLTP architecture, when load increases and your JVM is under pressure, you need to scale. You have two choices:
·                     vertical scaling - adding more CPU power to your JVM
·                     horizontal scaling - adding more JVMs (usually one more boxes)

It's generally never any problem scaling the business tier horizontally. Follow J2EE / JEE specs and unless you've done something crazy your business tier will scale. Just add more JVMs and load balance between them. However, while the business tier may be straightforward, the persistence tier ain't so easy. Let's say you are using a classical relational database (such as MySQL, SQLServer, DB2 or Oracle) for your persistence, you can't just add database machines like you can add JVMs. Why not? Imagine trying to do SQL joins when tables are on the same machine and when the tables are on different machines! Imagine trying to do maintain ACID characteristics for your transactions when your database is split across various CPUs? Now think trying to do all that on 5 machines, 50 , 500, 5000 machines? The more machines the harder it gets.

The leading relational databases will scale horizontally. But only by so much. To get around this an architect usually will consider:
·                     Scaling vertically - putting the database on the best hardware that can be afforded
·                     Partitioning out legacy data and thus reduce things like the size of index tables. This will boost performance and put less pressure on the need to scale
·                     Remove the amount of pressure on the database by caching more in the business tier
·                     Pay a DBA a lot of money!

But what if you just run out of all possible database optimizations options and you have to scale horizontally? Not just to a few machines but to a few hundred if not thousand. This is where NoSQL architectures become relevant.

With a NoSQL database there is no strict schema. Everything is effectively collapsed into one very fat table - a bit like an old school flat file, but where each row stores a huge amount of data. So, instead of having a table for Users and a table for Activities (representing User's activities), you put all the User information together in one fat row. This means there are no joins across tables. It also means there is a lot of data redundancy which means more storage space required. In addition, more computational power will be needed for writes. But because data that is used data is located at the very same place - within the same row - it means no complex joins and hence it is easier to scale. The computational requirement for reads is also less. So reads can go faster.

Another advantage of NoSQL databases is derived from the freedom that comes with not having to be tied to strict schema. You know that headache where a change to a data model can cause big problems? Well since there is no strict schema with NoSQL - this problem does not exist. This makes the architecture more flexible and more extensible.

Right now, it's fair to say NoSQL is only relevant in the minority of architectures. But could this be another case of technical innovation driving business innovation as we have seen with smart phones? There wasn't a need for smart phones but the technical innovation provided business opportunities. I think the same could happen with NoSQL Architectures.

Take a step back from Computer Science and just think Science. Science used to be hypothesis centric, now it is becoming more and more data centric. CERN, genome sequencing, climate change analysis - all involve tonnes and tonnes of data. Surely NoSQL architectures allied with searching technologies such as MapReduce / Hadoop will open up new ways to do Science?

So any disadvantages with NoSQL architectures? Well it's still an immature technology. Indexing, Security models are just not as sophisticated as they are with classical relational databases. 

Getting the size of an Object


Overview

Java was designed with the principle that you shouldn't need to know the size of an object. There are times when you really would like to know and want to avoid the guess work.

Measuring how much memory an object uses

There are three factors which make measuring how much an object uses difficult.
  • The TLAB allocates blocks of memory to a thread.  This means small amount of memory don't appear to reduce the free memory. If you do this repeatedly, you will see a block of free memory be used. The way around this is to turn off the TLAB. -XX:-UseTLAB
  • A GC can occur while you are creating your object. This will result in more free memory at the end than when you started. I ignore any negative sizes in this test ;)
  • Other threads in the system could use memory at the same time. I perform multiple test and take the median, which removes any outliers.
 Size of objects in a 32-bit JVM

Running this SizeofTest, with on 32-bit Sun/Oracle Java 6 update 26, -XX:-UseTLAB I get
The average size of an int is 4.0 bytes
The average size of an Object is 8.0 bytes
The average size of an Integer is 16.0 bytes
The average size of a Long is 16.0 bytes
The average size of an AtomicReference is 16.0 bytes
The average size of an SimpleEntry(Map.Entry) is 16.0 bytes
The average size of a DateTime is 24.0 bytes
The average size of a Calendar is 424.0 bytes
The average size of an Exception is 400.0 bytes
The average size of a bit in a BitSet is 0.125 bytes

Looking a the size of Long confirms the size of header/Object being 8 bytes.

Size of objects with 32-bit references

Running this SizeofTest, with 32-bit references On Sun/Oracle Java 6 update 26, -XX:+UseCompressedOops -XX:-UseTLAB I get
The average size of an int is 4.0 bytes
The average size of an Object is 16.0 bytes
The average size of an Integer is 16.0 bytes
The average size of a Long is 24.0 bytes
The average size of an AtomicReference is 16.0 bytes
The average size of an SimpleEntry(Map.Entry) is 24.0 bytes
The average size of a DateTime is 24.0 bytes
The average size of a Calendar is 448.0 bytes
The average size of an Exception is 440.0 bytes
The average size of a bit in a BitSet is 0.125 bytes

Objects are 8-byte aligned on this JVM, and you could conclude from the size of an Integer that the header is 12-bytes in size.

Size of objects with 64-bit references

Running the same test with 64-bit references. i.e. -XX:-UseCompressedOops -XX:-UseTLAB
The average size of an int is 4.0 bytes
The average size of an Object is 16.0 bytes
The average size of an Integer is 24.0 bytes
The average size of a Long is 24.0 bytes
The average size of an AtomicReference is 24.0 bytes
The average size of an SimpleEntry(Map.Entry) is 32.0 bytes
The average size of a DateTime is 32.0 bytes
The average size of a Calendar is 544.0 bytes
The average size of an Exception is 648.0 bytes
The average size of a bit in a BitSet is 0.125 bytes
From looking at the size of a Long, confirms the size of the header/Object is 16 bytes in length.


// sizeofUtil.java

package com.google.code.java.core.sizeof;

import java.util.Arrays;

public abstract class SizeofUtil {
  public double averageBytes() {
    int runs = runs();
    double[] sizes = new double[runs];
    int retries = runs / 2;
    final Runtime runtime = Runtime.getRuntime();
    for (int i = 0; i < runs; i++) {
      Thread.yield();
      long used1 = memoryUsed(runtime);
      int number = create();
      long used2 = memoryUsed(runtime);
      double avgSize = (double) (used2 - used1) / number;
//            System.out.println(avgSize);
      if (avgSize < 0) {
        // GC was performed.
        i--;
        if (retries-- < 0)
          throw newRuntimeException("The eden space is not large enough to hold all the objects.");
      } else if (avgSize == 0) {
        throw newRuntimeException("Object is not large enough to register, try turning off the TLAB with -XX:-UseTLAB");
      } else {
        sizes[i] = avgSize;
      }
    }
    Arrays.sort(sizes);
    return sizes[runs / 2];
  }

  protected long memoryUsed(Runtime runtime) {
    returnruntime.totalMemory() - runtime.freeMemory();
  }

  protected int runs() {
    return 11;
  }

  protected abstract int create();
}