August 04, 2008

Automating UniCluster Express Compute Node Installation

code_000000237891Small.jpgAlthough installing UniCluster Express (UCE) is in principle a fairly straightforward task, the thought of running UCE installer script manually on a large number of compute nodes probably does not sound very exciting to most people. Recently I found myself in that situation, and decided that spending some time automating the compute node installation procedure would be very wisely spent. Fortunately, this was not a very difficult task, and I got things working in a couple of hours using expect to drive the installation process. My installation script is shown below. It assumes that configuring advanced UCE options is not needed, and that root account has passwordless access enabled to compute nodes. After customizing few variables at the top of the script (and possibly the script login part), its usage should be fairly simple, as it requires only the compute node name to be provided as argument.
$ cat install_uce_compute.exp 
#!/usr/bin/env expect

# Usage:
#   $0 <node name>
# Install script location.
set node [lindex $argv 0]
set root_password ""
set root_prompt "]# "

set install_script "/root/unicluster/install_unicluster.sh"
set owner "ucluster"
set install_path "/usr/local/unicluster"
set bootstrap_server "hp2node2"
set bootstrap_client "ucluster"
set bootstrap_password "123456"

# Check arguments.
if {$argc < 1} {
  send_user "Node name not provided.\n"
  exit
}

# Login.
spawn ssh root@$node 
#expect password:
#send "$root_password\r"

expect "$root_prompt"
send "$install_script\r"

# License agreement.
expect "yes|no]: "
send "yes\r"

# Node type.
expect "workstation]:"
send "compute\r"

# Owner.
expect "ucluster]: "
send "$owner\r"

# Installation path.
expect "/usr/local/unicluster]: "
send "$install_path\r"

# Bootstrap server.
expect "bootstrap server: "
send "$bootstrap_server\r"

# Bootstrap account.
expect "ucluster]: "
send "$bootstrap_client\r"

# Bootstrap password.
expect "password: "
send "$bootstrap_password\r"
expect "verification: "
send "$bootstrap_password\r"

# Advanced options.
expect "y|N]:"
send "\r"
expect "to continue."
send "\r"

# Timeout.
set timeout 600

# Done.
expect "Installation completed successfully"
exit
Once your master node is up and running and you have expect rpm installed on your machine, installing compute nodes using the above script should be fairly straightforward, and involve simple shell command similar to the following:
$ for node in hp2node2 hp2node3 ; do ./install_uce_compute.exp $node; done
Just to be complete, the corresponding uninstall script (which should work for any UCE node type) is given below:
$ cat uninstall_uce.exp
#!/usr/bin/env expect

# Usage:
#   $0 <node name>

# Install script location.
set node [lindex $argv 0]
set root_password ""
set root_prompt "]# "

set uninstall_script "/root/unicluster/uninstall_unicluster.sh"
set install_path "/usr/local/unicluster"

# Check arguments.
if {$argc < 1} {
  send_user "Node name not provided.\n"
  exit
}

# Login.
spawn ssh root@$node 
#expect password:
#send "$root_password\r"

expect "$root_prompt"
send "$uninstall_script\r"

# Install path.
expect "path: "
send "$install_path\r"

# Backup.
expect "path: "
send "\r"

# Timeout.
set timeout 600

# Done.
expect "Uninstallation completed successfully"
exit
Hopefully, if you are reading this blog and considering installing UCE on your cluster, these scripts will save you some time and typing.

July 30, 2008

Unicluster Express and Ganglia Web Frontend

code_000000237891Small.jpgThose cluster administrators who installed UniCluster Express (UCE) probably know that even though UCE bundles Ganglia, it does not enable Ganglia’s well-known web frontend by default. Instead, cluster monitoring is done using the UCE Monitoring Console. Most certainly, one of the biggest advantages of the UCE monitoring tool is that it integrates cluster and job monitoring data from several different sources: ganglia monitoring daemon, Grid Engine qstat command, and ARCO database. This helps users and administrators get better and more comprehensive insight into the state of the cluster than views provided by the Ganglia web frontend. Nevertheless, for some administrators the convenience of web access to cluster monitoring information may be a significant incentive to get the Ganglia web frontend working with UCE. Although this is not difficult, there are a few things that one has to worry about, so it is worth writing down the procedure. Before editing various configuration files, make sure that you have web server installed on your main UCE node, together with several PHP packages. On my development machine I have the following installed:
$ rpm -qa | grep http
httpd-2.2.3-11.el5.centos
$ rpm -qa | grep php
php-gd-5.1.6-15.el5
php-common-5.1.6-15.el5
php-5.1.6-15.el5
php-cli-5.1.6-15.el5
Ganglia configuration files can be found in $GLOBUS_LOCATION/etc, where $GLOBUS_LOCATION points to the UCE installation directory (/usr/local/unicluster by default). Edit both gmetad.conf and gmond.conf and change the default cluster name, as special characters (i.e., parentheses) will likely confuse various Ganglia scripts.
$ cd /usr/local/unicluster/etc/
$ cp gmetad.conf gmetad.conf.orig
$ vi gmetad.conf 
$ diff gmetad.conf gmetad.conf.orig 
38,39c38
< #data_source "UniCluster Express (petruchio.psvm.univa.com)" localhost:8649
< data_source "UniCluster" localhost:8649
---
> data_source "UniCluster Express (petruchio.psvm.univa.com)" localhost:8649
$ cp gmond.conf gmond.conf.orig
$ vi gmond.conf 
$ diff gmond.conf gmond.conf.orig 
20,21c20
<   #name = "UniCluster Express (petruchio.psvm.univa.com)"
<   name = "UniCluster"
---
>   name = "UniCluster Express (petruchio.psvm.univa.com)"
After editing those files on the main node, propagate changes to all of your compute nodes and restart Ganglia daemons everywhere (both gmetad and gmond on the main node, and gmond on all compute nodes). The final step involves running UCE install-ganglia-gui.sh script which will unpack Ganglia web frontend files into a subdirectory under /var/www/html:
 $ cd /usr/local/unicluster/setup/globus/
 $ ./install-ganglia-gui.sh 
 $ cd /var/www/html/ganglia
 $ mv web/* .
 $ rmdir web
The last few commands in the above are not strictly necessary, they simply move Ganglia web frontend into the “standard” Ganglia location. After this is done, start your web server and your monitoring information will be accessible via the usual “http://<main node name>/ganglia/” link.

July 22, 2008

I have a Theory

iStock_000002311523Small.jpgIt was with great curiosity that I read Chris Anderson's article on the end of theory. To summarize his position, the "hypothesize, model, and test" approach to science has become obsolete now that there are petabytes of information and countless numbers of computers capable of processing that data. Further, this data-tsunami has made the search for models of real-world phenomena pointless because, "correlation is enough."

The first thing that struck me as ironic about this argument is that statistical correlation is itself a model including all of its associated simplified and assumptive baggage. Just how do I assign a measure of similarity between a set of objects without having a mathematical representation (i.e. a model) of those things? How might I handle strong negative-correlation in this analysis? What about the null hypothesis? While not interesting, per se, it is useful information. Will a particular measurement be allowed to correlate with more than a single result-cluster?

Additionally, we must decide how to relate these petabytes of measurements into correlated-clusters. As before, the statistics that are used to calculate correlation are also models. Are we considering Gaussian distributions, scale-invariant power-laws, or perhaps a state-driven sense of probability? Are we talking about events that have a given likelihood such as the toss of a coin or, more likely, subjective plausibility? You need to be very cautious when choosing your statistical model. For example, using a bell-curve to describe unbounded-data destroys any real sense of correlation.

Regardless of how you statistically model your measurements, you must understand your data lest your correlations may not make sense. For example, imagine that I have two acoustic time-series. How do I measure the correlation of these two recordings to determine how well the are related? The standard approach is to simply convolve the two signals and look for a value that indicates “significant correlation”, whatever your model for that turns out to be. Yet this doesn't mean much unless I understand my data. Were each of these time-series recorded at the same sampling rate? For example, if I have 20 samples of a 10Hz sine-wave recorded at 100 samples per second it will appear exactly the same as 20 samples of a 5Hz sine-wave recorded at 50 samples per second. If I naively plot the samples, they will correlate perfectly. Basically, if I don't understand my data, I can easily erroneously report that the correlation of the two signals is perfect when in fact they have zero correlation.

Finally, what I find most intriguing is the presumption that the successful correlation of petabytes of data culled web-pages and the associated viewing habits data somehow generalizes into a method for science in general. Unlike the “as-seen on TV” products I see in infomercials, statistical inference is not the only tool that I will ever need. Restricting ourselves to correlation removes one of the most powerful tools we have: prediction. Without it, scientific discovery would be hobbled.

Consider, the correlation of all of the observed information regarding plate-boundary movement (through some model of the earth) along a fault such as the San Andreas. Keep in mind that enormous amounts of data are collected in this region. Anyway, quiet areas along the fault would either imply that a particular piece of the fault were no longer seismically-active or, using anti-correlation, that the “slip deficit” suggested that a much larger earthquake was more likely to occur in the future for that zone (These areas are referred to as seismic gaps). Moreover, the Parkfield segment of the San Andreas fault has large earthquakes approximately every twenty years. A correlative model would suggest that the entire plate-boundary should be similar which is simply not true as proven by the Anza Seismic Gap. Furthermore, correlation would also have implied that another large event should have occurred along the Parkfield Gap in the late 80s. If science were only concerned with correlation, one instrument in this zone would have been sufficient. However, the diverse set of predictions made by researchers demanded a wide variety of experiments. Consequently, this zone became the most heavily instrumented area in the world in an effort to extensively study the expected large event. They had to wait for over fifteen years for this to happen. Then there are events that few would have predicted (Black Swans) such as “slow” earthquakes which require special instrumentation to capture. These phenomena, until recently, were not able to be correlated with anything and thus, never would have existed. In fact, one of the first observations of these events was attributed to instrument error.

Clearly correlation is but one approach to modeling processes amongst many. I have a theory that we in the grid community can expect to help scientists solve many different types of theoretical problems for a good long time. Now to test...

July 08, 2008

Using DRMAA with Unicluster Express

code_000000237891Small.jpgDistributed Resource Management Application API (DRMAA) is a high-level API that allows Grid applications to submit, monitor and control jobs to one or more DRM systems. Grid Engine comes with support for C/C++ and java, and one can also download bindings for ruby and python. There is also a nice collection of HowTos that should provide a great start for anyone looking to start writing DRMAA applications. The latest version of Unicluster Express (UCE) bundles Grid Engine 6.1u3, which is installed under $GLOBUS_LOCATION/sge. The $GLOBUS_LOCATION refers to the UCE installation directory (/usr/local/unicluster by default), and all of the DRMAA libraries and java files are located in the $GLOBUS_LOCATION/sge/lib directory. In order to run DRMAA applications, one has to set $LD_LIBRARY_PATH to point to the appropriate (architecture dependent) directory. For my development (64-bit linux) cluster with default UCE installation I used the following setup:
$ source /usr/local/unicluster/unicluster-user-env.sh
$ export LD_LIBRARY_PATH=/usr/local/unicluster/sge/lib/lx24-amd64
$ export JAVA_HOME=/opt/jdk
$ export PATH=$JAVA_HOME/bin:$PATH
A very simple example of a java DRMAA application that submits a job to Grid Engine is shown below:
$ cat SimpleJob.java 
import org.ggf.drmaa.DrmaaException;
import org.ggf.drmaa.JobTemplate;
import org.ggf.drmaa.Session;
import org.ggf.drmaa.SessionFactory;
public class SimpleJob {
  public static void main(String[] args) {
    SessionFactory factory = SessionFactory.getFactory();
    Session session = factory.getSession();
    try {
      session.init("");
      JobTemplate jt = session.createJobTemplate();
      jt.setRemoteCommand("/home/veseli/simple_job.sh");
      String id = session.runJob(jt);
      System.out.println("Your job has been submitted with id " + id);
    } 
    catch (DrmaaException e) {
      System.out.println("Error: " + e.getMessage());
    }
  }
}
One can compile and run the above example using something like the following:
$ javac -classpath /usr/local/unicluster/sge/lib/drmaa.jar SimpleJob.java 
$ java -classpath .:/usr/local/unicluster/sge/lib/drmaa.jar SimpleJob
Your job has been submitted with id 14
$ qstat -f 
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@horatio.psvm.univa.com   BP    1/1       0.36     lx24-amd64  
14 0.55500 simple_job veseli       r     06/20/2008 12:24:59     	1          
----------------------------------------------------------------------------
all.q@romeo.psvm.univa.com     BP    0/1       0.39     lx24-amd64    
----------------------------------------------------------------------------
all.q@yorick.psvm.univa.com    BP    0/1       0.45     lx24-amd64    
----------------------------------------------------------------------------
headnodes.q@petruchio.psvm.uni IP    0/1       0.15     lx24-amd64    
----------------------------------------------------------------------------
special.q@horatio.psvm.univa.c BIP   0/1       0.36     lx24-amd64    

I should point out that DRMAA is designed to be independent of any particular DRM. Those users that need job submission features or flags specific to Grid Engine can either use the “native specification” attribute, or they can use the “job category” attribute together with “qtask” files. In order to set native specification attribute in java one would use setNativeSpecification() method of the JobTemplate class (before the job submission line in the code):
jt.setNativeSpecification("-q special.q");
This method, however, makes your application dependent on the specific DRM you are working with at the moment. The above line will be interpreted correctly by Grid Engine, but may not be understood by other DRMs. In most cases a better solution is to use the job category attribute instead, and specify the DRM-dependent flags in the qtask file. For example, in order to submit your job to a particular Grid Engine queue in the java code one would have something like
jt.setJobCategory("special");
and use the qtask file to translate the “special” job category into appropriate Grid Engine flags:
$ cat ~/.qtask
special -q special.q
The cluster global qtask file (defines cluster wide defaults) in UCE resides at $GLOBUS_LOCATION/sge/default/common/qtask. As shown above, user-specific qtask files that override and enhance cluster-wide definitions are found at ~/.qtask.

July 02, 2008

Aromatic Clouds?

conference_000003749151XSmall.jpg

If you weren’t at OSGC you missed a number of interesting presentations. From my perspective, one of the most intriguing technologies was EUCALYPTUS: Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems.

Before I go on, I would like you to notice that anybody who is able to make an acronym out of eucalyptus has some time on their hands. Fortunately, they used this time to implement an open-source infrastructure for Elastic Computing. In particular, the goal of the project is to, "foster community research and development of Elastic/Utility/Cloud service implementation technologies, resource allocation strategies, service level agreement (SLA) mechanisms and policies, and usage models."

In my opinion, the most interesting facets of this project are:

  • It is compatible with the Amazon EC2 tools out of the box yet it is agnostic and thus is capable of supporting any number of client interfaces;
  • Any team can assemble a development environment for tools that they wish to deploy to the EC2 Cloud;
  • A group could create their own Cloud system which could use EC2 for Utility computing resources;
  • It is the first step towards creating an open-standard for Cloud computing.

My hope is that this project will not only get us all thinking about what we really need from a Cloud but also what we could improve... I plan to start working with this software as soon as it is available later this month.

June 30, 2008

About Grid Engine Advance Reservations

code_000000237891Small.jpgAdvance reservation (AR) capability is one of the most important new features of the upcoming Grid Engine 6.2 release. New command line utilities allow users and administrators to submit resource reservations (qrsub), view granted reservations (qrstat), or delete reservations (qrdel). Also, some of the existing commands are getting new switches. For example, the “-ar <AR id>“ option for qsub indicates that the submitted job is a part of an existing advanced reservation. Given that AR is a new functionality, I thought that it might be useful to describe how it works on a simple example (using 6.2 Beta software). Advanced resource reservations can be submitted to Grid Engine by queue operators and managers, and also by a designated set of privileged users. Those users are defined in ACL “arusers”, which by default looks as follows:

$ qconf -sul
arusers
deadlineusers
defaultdepartment
$ qconf -su arusers
name    arusers
type    ACL
fshare  0
oticket 0
entries NONE

The “arusers” ACL can be modified via the “qconf -mu” command:

$ qconf -mu arusers
veseli@tolkien.ps.uud.com modified "arusers" in userset list
$ qconf -su arusers
name    arusers
type    ACL
fshare  0
oticket 0
entries veseli

Once designated as a member of this list, the user is allowed to submit ARs to Grid Engine:

[veseli@tolkien]$ qrsub -e 0805141450.33 -pe mpi 2
Your advance reservation 3 has been granted
[veseli@tolkien]$ qrstat
ar-id   name       owner        state start at             end at               duration
-----------------------------------------------------------------------------------------
      3            veseli       r     05/14/2008 14:33:08  05/14/2008 14:50:33  00:17:25
[veseli@tolkien]$ qstat -f 
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@tolkien.ps.uud.com       BIP   2/0/4          0.04     lx24-x86      

For the sake of simplicity, in the above example we have a single queue (all.q) that has 4 job slots and a parallel environment (PE) mpi assigned to it. After reserving 2 slots for the mpi PE, there are only 2 slots left for running regular jobs until the above shown AR expires. Note that the "–e" switch for qrsub designates requested reservation end time in the format YYMMDDhhmm.ss. It is also worth pointing out that the qstat output changed slightly with respect to previous software releases in order to accommodate display of existing reservations. If we now submit several regular jobs, only 2 of them will be able to run:

[veseli@tolkien]$ qsub regular_job.sh 
Your job 15 ("regular_job.sh") has been submitted
...
[veseli@tolkien]$ qsub regular_job.sh
Your job 19 ("regular_job.sh") has been submitted
[veseli@tolkien]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@tolkien.ps.uud.com       BIP   2/2/4          0.03     lx24-x86      
     15 0.55500 regular_jo veseli       r     05/14/2008 14:34:32     1       
     16 0.55500 regular_jo veseli       r     05/14/2008 14:34:32     1       
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     17 0.55500 regular_jo veseli       qw    05/14/2008 14:34:22     1       
     18 0.55500 regular_jo veseli       qw    05/14/2008 14:34:23     1       
     19 0.55500 regular_jo veseli       qw    05/14/2008 14:34:24     1       

However, if we submit jobs that are part of the existing AR, those are allowed to run, while jobs submitted earlier are still pending:

[veseli@tolkien]$ qsub -ar 3 reserved_job.sh 
Your job 20 ("reserved_job.sh") has been submitted
[veseli@tolkien]$ qsub -ar 3 reserved_job.sh
Your job 21 ("reserved_job.sh") has been submitted
[veseli@tolkien]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@tolkien.ps.uud.com       BIP   2/4/4          0.02     lx24-x86      
     15 0.55500 regular_jo veseli       r     05/14/2008 14:34:32     1       
     16 0.55500 regular_jo veseli       r     05/14/2008 14:34:32     1       
     20 0.55500 reserved_j veseli       r     05/14/2008 14:35:02     1       
     21 0.55500 reserved_j veseli       r     05/14/2008 14:35:02     1       
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     17 0.55500 regular_jo veseli       qw    05/14/2008 14:34:22     1       
     18 0.55500 regular_jo veseli       qw    05/14/2008 14:34:23     1       
     19 0.55500 regular_jo veseli       qw    05/14/2008 14:34:24     1       

The above example illustrates how ARs work. As long as particular reservation is valid, only jobs that are designated as part of it can utilize resources that have been reserved. I think that AR will prove to be extremely valuable tool for planning grid resource usage, and I’m very pleased to see it in the new Grid Engine release.

June 06, 2008

Steaming Java

code_000000237891Small.jpg

When Rich asked us to walk through a software development process, I immediately thought back to a conversation that I had with my friend Leif Wickland about building high-performance Java applications. So I immediately emailed him asking him for his best practices. We have both produced code that is as fast, if not faster than C compiled with optimization (for me it was using a 64-bit JRE on a x86_64 architecture with multiple cores).

That is not to say that if you were to spend time optimizing the equivalent C-code that it would not be made to go faster. Rather, the main point is that Java is a viable HPC language. On a related note, Brian Goetz of Sun has a very interesting discussion on IBM's DeveloperWorks, Urban performance legends, revisited on how garbage collection allows faster raw allocation performance.

However I digress… Here is a summary of what we both came up with (in no particular order):

           
  1. It is vitally important to "measure, measure, measure," everything you do.  We can offer any set of helpful hints but the likelihood that all of them should be applied is extremely low.
  2.        
  3. It is equally important to remember to only optimize areas in the program that are bottlenecks. It is a waste of development time for no real gain.
  4.        
  5. One of the most simple and overlooked things that help your application is to overtly specify method parameters that are read-only using the final modifier. Not only can it help the compiler with optimization but it also is a good way of communicating your intentions to your teammates. Furthermore, i f you can make your method parameters final, this will help even more. One thing to be aware of is that not all things that are declared final behave as expected (see Is that your final answer? for more detail).
  6.        
  7. If you have states shared between threads, make whatever you can final so that that the VM takes no steps to ensure consistency. This is not something that we would have expected to make a difference, but it seems to help.
  8.        
  9. An equally ignored practice is using the finally clause. It i s very important to clean up the code in a try block. You could leave open streams, SQL queries, or perhaps other objects lying around taking up space.        
  10. Create your data structures and declare your variables early. A core goal is to avoid allocating short-lived variables. While it is true that the garbage collector may reserve memory for variables that are declared often, why make it have to try to guess your intentions. For example, if a loop is called repeatedly, there is no need to say, for (int i = 0; … when you should have declared i earlier. Of course you have to be careful not to reset counters from inside of loops.        
  11. Use static for values that are constants. This may seem obvious, but not everybody does.
  12.        
  13. For loops embedded within other loops:                
                              
    • Replace your outer loop with fixed-pool of threads. In the next release of java, this will be even easier using the fork-join keywords. This has become increasingly important with processors with many cores.
    •                         
    • Make sure that your innermost loop is the longest even if it doesn't necessarily map directly to the business goals. You shouldn't force the program to create a new loop too often as it wastes cycles.
    •        
    • Unroll your inner-loops. This can save an enormous amount of time even if it isn't pretty. The quick test I just ran was 300% faster. If you haven' t unrolled a loop before, it is pretty simple:        
              unrollRemainder = count%LOOP_UNROLL_COUNT;
             
              for( n = 0; n < unrollRemainder; n++ ) {
                  // do some stuff here.
              }
             
              for( n = unrollRemainder; n < count; n+=LOOP_UNROLL_COUNT ) {
                  // do stuff for n here
                  // do stuff for n+1 here
                  // do stuff for n+2 here
                  …
                  // do stuff for n+LOOP_UNROLL_COUNT - 1 here
              }
              Notice that both n and unrollRemainder were declared earlier as recommended previously.
  14.        
  15. Preload all of your input data and then operate on it later. There is absolutely no reason that you should be loading data of any kind inside of your main calculation code. If the data doesn't fit or belong on one machine, use a Map-Reduce approach to distribute it across the Grid.
  16.        
  17. Use the factory pattern to create objects.                
                              
    • Data structures can be created ahead of time and only the necessary pieces are passed to the new object.
    •                         
    • Any preloaded data can also be segmented so that only the necessary parts are passed to the new object.
    •                         
    • You can avoid the allocation of short-lived variables by using constructors with the final keyword on its parameters.
    •                         
    • The factory can perform some heuristic calculations to see if a particular object should even be created for future processing.
  18.        
  19. When doing calculations on a large number of floating-point values, use a byte array to store the data and a ByteWrapper to convert it to floats. This should primarily be used for read only (input) data. If you are writing floating-point values you should do this with caution as it may take more time than using a float array. One major advantage that Java has when you use this approach is that you can switch between big and little-endian data rather easily.
  20.        
  21. Pass fewer parameters to methods. This results in less overhead. If you can pass a static value it will pass one fewer parameter.
  22.        
  23. Use static methods if possible. For example, a FahrenheitToCelsius(float fahrenheit); method could easily be made static. The main advantage here is that the compiler will likely inline the function.
  24.        
  25. There is some debate whether you should make particular methods final if they are called often. There is a strong argument to not do this because the enhancement is small or nonexistent (see Urban Performance Legends or once again Is that your final answer?). However my experience is that a small enhancement on a calculation that is run thousands of times can make a significant difference. Both Leif and I have seen measurable differences here. The key is to benchmark your code to be certain.

June 02, 2008

Grid Interoperability and Interoperation

integration_000006229427Small.jpg

The high expectations raised by grid computing have favored the development and deployment of a growing number of grid infrastructures and middlewares. However, the interaction between these grids is still limited, so reducing the potential large-scale application of grid technology, in spite of efforts made by grid community. In this sense, the Open Grid Forum (OGF) is developing open standards for grid software interoperability, while the OGF's Grid Interoperation Now Community Group (GIN-CG) is coordinating a set of interoperation efforts among production grids. It is therefore clear that, according to OGF (as Laurence Field explains in his article entitled "Getting Grids to work together: interoperation is key to sharing"), there is a big difference between these two terms:

  • Interoperability is the native ability of grids and grid technologies to interact directly via common open standards.
  • Interoperation is a set of techniques to get production grid infrastructures to work together in the short term.

Since most common open standards to provide grid interoperability are still being defined and only a few have been consolidated, grid interoperation techniques, like adapters and gateways, are needed. An adapter is, according to different dictionaries of computer terms, “a device that allows one system to connect to and work with another”. On the other hand, a gateway is conceptually similar to an adapter, but it is implemented as an independent service, acting as a bridge between two systems. The main drawback of adapters is that grid middleware or tools must be modified to insert the adapters. Gateways can be accessed without changes on grid middleware or tools, but they can become a single point of failure or a scalability bottleneck.

GridWay provides support for some of the few established standards like DRMAA, JSDL or WSRF to achieve interoperability but, in the meanwhile, it also provides components to allow interoperation, like Middleware Access Drivers (MADs) acting as adapters for different grid services, and the GridGateWay, which is a WSRF GRAM service encapsulating an instance of GridWay, thus providing a gateway for resource management services.

GridWay 4.0.2, coinciding with the release of Globus Toolkit 4 and its new WS GRAM service, introduced an architecture for the execution manager module based on a MAD (Middleware Access Driver) to interface several grid execution services, like pre-WS GRAM and WS GRAM, even simultaneously. That architecture was presented in the paper entitled "A modular meta-scheduling architecture for interfacing with pre-WS and WS Grid resource management services" (E. Huedo, R. S. Montero and I. M. Llorente). GridWay 5.0 took advantage of this modular architecture to implement an information manager module with a MAD to interface several grid information services, and a transfer manager module with a MAD to interface several grid data services. Moreover, the scheduling process was decoupled from the dispatch manager through the use of an external and selectable scheduler module.

GridWay components

The resulting architecture, which is shown above, provides direct interoperation between different middleware stacks. In fact, we demonstrated at OGF22 the interoperation of three important grid infrastructures, namely EGEE (gLite-based), TeraGrid and OSG (both Globus-based), being coordinately used through a single GridWay instance by means of the appropriate adapters. To set an example, the application was written using the DRMAA OGF standard. GridWay documentation provides a lot of information on how to integrate GridWay in the main middleware stacks, like gLite, pre-WS and WS Globus, or ARC, and provides information on how to develop new drivers for other middlewares.

OGF22 interoperation demo

Regarding the GridGateWay, it is being used for provisioning resources from several infrastructures. For example, the German Astronomy Community Grid (GACG or AstroGrid-D) uses a GridGateWay as a central resource broker, providing metascheduling functionality to Globus-based submission tools (e.g. for workflow execution) without modification. GridAustralia also uses a GridGateWay as a WSRF interface for its central GridWay Metascheduler instance, allowing reliable, remote job submission.

Astrogrid-D metascheduling architecture
Picture by AstroGrid-D

More information about the GridGateWay component is provided in its web page, as well as in this blog entry, which shows how to build Utility Computing infrastructures with this Globus-based gateway technology.


Eduardo Huedo

Reprinted from blog.dsa-research.org

May 14, 2008

Grid Engine 6.2 Beta Release

package_000005071512XSmall.jpgGrid Engine 6.2 will come with some interesting new features. In addition to advance resource reservations and array job interdependencies, this release will also contain a new Service Domain Manager (SDM) module, which will allow distributing computational resources between different services, such as different Grid Engine clusters or application servers. For example, SDM will be able to withdraw unneeded machines from one cluster (or application server) and assign it to a different one or keep it in its “spare resource pool”. It is also worth mentioning that Grid Engine (and SDM) documentation is moving to Sun’s wiki. The 6.2 beta release is available for download here.

May 05, 2008

About Parallel Environments in Grid Engine

code_000000237891Small.jpgSupport for parallel jobs in distributed resource management software is probably one of those features that most people do not use, but those who do appreciate it a lot. Grid Engine supports parallel jobs via parallel environments (PE) that can be associated with cluster queues. New parallel environment is created using the qconf -ap <environment name> command, and editing the configuration file that pops up. Here is an example of a PE slightly modified from the default configuration:
$ qconf -sp simple_pe
pe_name           simple_pe
slots             4
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    FALSE
job_is_first_task FALSE
urgency_slots     min
In the above example, “slots” defines number of parallel tasks that can be run concurrently. The “user_lists” (“xuser_lists”) parameter should be a comma-separated list of user names that are allowed (denied) use of the given PE. If “user_lists” is set to NONE, any user that is not explicitly disallowed via the “xuser_lists” parameter. The “start_proc_args” and “stop_proc_args” represent command line of startup and shutdown procedures for the parallel environment. These commands are usually scripts customized for a specific parallel library intended for a given PE. They get executed for each parallel job, and are used, for example, start any necessary daemons that enable parallel job execution. The standard output (error) of these commands are redirected into <job name>.po(pe).<job id> files in the job’s working directory, which is usually user’s home directory. It is worth noting that the customized PE startup and shutdown scripts can make use of several internal variables, such as $pe_hostfile and $job_id, that are relevant for the parallel job. The $pe_hostfile variable in particular points to a temporary file that contains list of machines and parallel slots allocated for the given job. For example, setting “start_proc_args” to “/bin/cp $pe_hostfile /tmp/machines.$job_id” would copy $pe_hostfile to the /tmp directory. Some of those internal variables are also available to job scripts as environment variables. In particular $PE_HOSTFILE and $JOB_ID environment variables will be set and will correspond to $pe_hostfile and $job_id, respectively. The “allocation_rule” parameter helps scheduler decide how to distribute parallel processes among the available machines. It can take an integer that fixes the number of processes per host, or special rules like $pe_slots (all processes have to be allocated on a single host), $fill_up (start filling up slots on the best suitable host, and continue until all slots are allocated), and $round_robin (allocate slots one by one on each allocated host in a round robin fashion until all slots are filled). The “control_slaves” parameter is slightly confusing. It indicates whether or not the Grid Engine execution daemon creates parallel tasks for a given application. In most cases (e.g., for MPI or PVM) this parameter should be set to FALSE, as custom Grid Engine PE interfaces are required for getting control of parallel tasks to work. Similarly, the “job_is_first_task” parameter is only relevant if control_slaves is set to TRUE. It indicates whether or not the original job script submitted execution is part of the parallel program. The “urgency_slot” parameter is used for jobs that request range of parallel slots. If an integer value is specified, that number is used as prospective slot amount. If “min”, “max”, or “avg” is specified, the prospective slot amount will be determined as the minimum, maximum or average of the slot range, respectively. After a parallel environment is configured and added to the system, it can be associated with any existing queue by setting the “pe_list” parameter in the queue configuration, and at this point users should be able to submit parallel job. On the GE project site one can find a number of nice How-To documents related to integrating various parallel libraries. If you do not have patience to build and configure one of those, but you would still like to see how stuff works, you can try adding a simple PE (like the one shown above) to one of your queues, and use a simple ssh-based master script to spawn and wait on the slave tasks:
#!/bin/sh
#$ -S /bin/sh
slaveCnt=0
while read host slots q procs; do
  slotCnt=0
  while [ $slotCnt -lt $slots ]; do
    slotCnt=`expr $slotCnt + 1`
    slaveCnt=`expr $slaveCnt + 1`
    ssh $host "/bin/hostname; sleep 10" > /tmp/slave.$slaveCnt.out 2>&1  &
  done
done < $PE_HOSTFILE
while [ $slaveCnt -gt 0 ]; do
  wait 
  slaveCnt=`expr $slaveCnt - 1`
done
echo "All done!"
After saving this script as "master.sh" and submitting your job using something like "qsub -pe simple_pe 3 master.sh" (where 3 is the number of parallel slots requested), you should be able to see your "slave" tasks running on the allocated machines. Note, however, that you must have password-less ssh access to the designated parallel compute hosts in order for the above script to work.