« November 2007 | Main | January 2008 »

December 2007

December 27, 2007

The Lonely API: You Can Help!

code_000000237891Small.jpg

I remain surprised that more grid developers and architects who rely on GridFTP for moving bits around have not tightly integrated data movement into their applications and infrastructure using the GridFTP Client API/library. A lot of folks spend a great deal of time wrapping scripts around the default GridFTP client globus-url-copy (guc) and begging the GridFTP developers to add features and functionality to guc.

Having a strong default client like guc is great but it cannot be the perfect tool for every scenario. I am convinced that a lot of workflow taking place on the grid could benefit from having applications and data placement more tightly integrated and one slick way to do it is to write custom GridFTP clients or add the functionality directly to existing applications and workflows.

I think grid developers and architects don't leverage the GridFTP client API/library because they think it is difficult. It is also hard to find comprehensive but straightforward examples (a GridFTP client API cookbook sure would be nice...). The code for the guc client is too hard for me to really get my head around because it is so entangled with some of the older Globus legacy code that isn't used much anymore (think GASS).

So below is a simple but useful client in C. This is all you need if you want to move a file from one GridFTP server to another (a third-party transfer) with some parallel data streams and some tuned TCP buffers.

The structure is simple. You need a function that will be used as a "callback" when the transfer is complete. It needs to take the form shown.

You need to create a "handle" and probably set some attributes on it (I like to set the cache_all attribute so that when moving multiple files from site to site the control connection stays open between files).

You need to create an "operation" attribute and decorate it with the details for data stream parallelism and TCP buffers sizes.

Then there is some boilerplate to initialize the library and wait for the callback to be called and the transfer to finish but it is all pretty straightforward.

So here a simple but useful client that you could easily extend for yourself:

#include "globus_ftp_client.h"
#include "globus_ftp_control.h"

static globus_mutex_t lock;
static globus_cond_t cond;
static globus_bool_t done;
static globus_bool_t error = GLOBUS_FALSE;

/* this function will be called by the library
   when the transfer completes */
static void fileTransferCompleteCallback(
    void *user_arg,
    globus_ftp_client_handle_t *handle,
    globus_object_t *err)
{
    char * tmpstr;

    if(err){
        fprintf(stdout,
          "Transfer completed with error: %s\n",
          tmpstr,
          globus_object_printable_to_string(err));
    }
    globus_mutex_lock(&lock);
    done = GLOBUS_TRUE;
    globus_cond_signal(&cond);
    globus_mutex_unlock(&lock);
}

int main(int argc, char **argv)
{

    globus_ftp_client_handle_t  handle;
    globus_ftp_client_handleattr_t handle_attr;
    globus_ftp_client_operationattr_t   attr;
    globus_result_t result;

    globus_ftp_control_parallelism_t  parallelism;
    globus_ftp_control_tcpbuffer_t  tcpbuffer;

    /* source and destination URLs for the 3rd
    party transfer */
    char *src = "gsiftp://server1.com/file1";
    char *dst = "gsiftp://server2.com/file2";

    globus_module_activate(GLOBUS_FTP_CLIENT_MODULE);

    globus_ftp_client_handleattr_init(&handle_attr);
    globus_ftp_client_operationattr_init(&attr);

    globus_mutex_init(&lock, GLOBUS_NULL);
    globus_cond_init(&cond, GLOBUS_NULL);

    /* cache control channel connections for efficiency */
    globus_ftp_client_handleattr_set_cache_all(
      &handle_attr, (globus_bool_t) GLOBUS_TRUE);

    globus_ftp_client_handle_init(&handle,&handle_attr);

    globus_ftp_client_operationattr_set_mode(
      &attr, GLOBUS_FTP_CONTROL_MODE_EXTENDED_BLOCK);

    /* use 4 parallel data streams for high throughput */
    parallelism.mode = GLOBUS_FTP_CONTROL_PARALLELISM_FIXED;
    parallelism.fixed.size = 4;

    globus_ftp_client_operationattr_set_parallelism(
      &attr,&parallelism);

    /* use large TCP windows for WANs with latency */
    tcpbuffer.mode =  GLOBUS_FTP_CONTROL_TCPBUFFER_FIXED;
    tcpbuffer.fixed.size = 1024 * 1024 * 1;

    globus_ftp_client_operationattr_set_tcp_buffer(
       &attr, &tcpbuffer);

    done = GLOBUS_FALSE;

    /* tell the servers to transfer the file */
    result = globus_ftp_client_third_party_transfer(
        &handle,
        src,
        &attr,
        dst,
        &attr,
        GLOBUS_NULL,
        fileTransferCompleteCallback,
        0
        );

    /* wait until notified the transfer is complete */
    globus_mutex_lock(&lock);
    while(!done){
        globus_cond_wait(&cond, &lock);
    }
    globus_mutex_unlock(&lock);

    /* shut it all down and clean up nicely */
    globus_ftp_client_handle_destroy(&handle);
    globus_module_deactivate_all();

    return 0;

}

Before compiling this you need to generate a file to be included with your Makefile. Just do

export GLOBUS_LOCATION=/path/to/globus/installation
source $GLOBUS_LOCATION/etc/globus-user-env.sh

globus-makefile-header --flavor=gcc32pthr globus_ftp_client globus_ftp_control > makefile.include

And finally here is a simple Makefile for compiling your custom client:

include makefile.include

myClient: myClient.o
$(GLOBUS_CC) -o myClient myClient.o \
$(GLOBUS_LDFLAGS) $(GLOBUS_LIBS) $(GLOBUS_PKG_LIBS)

myClient.o: myClient.c
$(GLOBUS_CC) $(GLOBUS_CFLAGS) $(GLOBUS_INCLUDES) \
-c -o myClient.o myClient.c

That's all it takes to create a custom but high performance GridFTP client--less than 100 lines of C code (even less in Python or Java).

So if you have to move bits around your grid and you ever thought 'it sure would be nice if globus-url-copy worked exactly how I need it to work' then try creating your own customized client. It's easier than you might have thought.

December 21, 2007

Grid Truths

telecommute_000003564223XSmall.jpgThanks to Brand Autopsy, I'm reminded of Google's Ten Things statement. Some of the things they consider to be core to their success are also core to making grid computing work.

Focus on the user and all else will follow

For Google this means a clean interface, fast response and honest results. Globus Toolkit and Cluster Express follow this idiom really well.

In the case of Globus Toolkit, their focus has long been allowing users to gain access to a wide variety of resources in a user centric model. The philosophy should be that the user get a single sign-on and be able to use that to move data, get monitoring information, submit jobs and delegate authority to other entities on the grid.

For Cluster Express the idea is to take Globus Toolkit and make it dead simple to install even while combining it with some of the other best open source tools around like Grid Engine and Ganglia. These are the tools users and administrators need to operate a cluster. The result has been a thousand downloads in the first few weeks this user focused package has been available.

It's best to do one thing really, really well

Grid computing is about managing resources effectively. That's it. We're not about making hardware, finding oil reserves, curing cancer or projecting financial markets. Because of that application agnosticism, grid computing works in all those domains and hundreds more.

Democracy on the web works

Grid computing, especially in the open source world, works via standards and many people working together on many different pieces of a complete solution. Scientific users like those at Argonne National Lab and Fermilab collaborate on data storage and movement standards. The best ideas really do win and the results are solutions that scale to billions of files and transfer rates at theoretical maximums even on multi-gigabit rates.

You don't need to be at your desk to need an answer.

With previous generations of cluster management tools, an admin or user had to be at their own computer to understand in any detail what state their jobs were in. Even though the grid allows them to access resources across multiple computers, it still also allows them to manage their workload from web portals, releasing them from any particular desk.

The need for information crosses all borders.

One of the largest stumbling blocks users used to face in HPC was getting access to the data they needed. Tools like Reliable File Transfer Service, Replica Location Service and GridFTP allow information to be scheduled and moved on a global basis.

You can be serious without a suit.

Amen.

Great just isn't good enough.

Grid computing is 11 years old now and has been helping to do great science for most of them. But the developers and users keep pushing the envelope and finding new and better ways to get more done.

Expect that to continue next year.

December 17, 2007

Top Four Things Cisco Learned Working on Open MPI

cables_000003656956XSmall.jpgThis entry was written by guest blogger Jeff Squyres from Cisco Systems. I met him at SC07 when I attended his Open MPI presentation in the Mellanox booth. He did a great job, much better than most of the presentations at tech conferences, and agreed to share some of his thoughts on how big companies can work effectively in an open source project with our readers.

The general idea of my talk is to help address the answer "Why is Cisco contributing to open source in HPC?" Indeed, much of Cisco's code is closed source. Remember that our crown jewels are the various flavors of IOS (the operating system that powers Cisco Ethernet routers); many people are initially puzzled as to why Cisco is involved in open source projects in HPC.

The short/obvious answer is: it helps us.

Cisco is a company that needs to make money and has a responsibility to its stockholders. We sell products in the HPC space and therefore need a rock-solid, high-performance MPI that works well on our networks. Many customers demand an open source solution, so it is in our best interests to help provide one rather than partially or wholly rely on someone else to provide one. In particular, some of these interests include (but are not limited to):

  • Having engineers at Cisco who can providing direct support to our customers who use open source products
  • Being able to participate in the process and direction of open source projects that are important to us (vs. being an outsider)
  • Leveraging the development and QA resources of both our partners and competitors -- effectively having our efforts magnified by the open source community (and vice versa)
  • Shortening the time between research and productization; working directly with our academic partners to turn today's whacky ideas into tomorrow's common technology

Think of it this way: only certain parties can mass-produce high quality hardware for HPC (i.e., vendors). But *many* people can help produce high quality software -- not just vendors. In the context of this talk, customers (including research and academic customers) have the expertise and capability to *directly* contribute to the software that runs on our hardware. HPC history has proven this point. We'd therefore be foolish to *not* engage HPC-smart customers, researchers, academics, partners, competitors, ...anyone who has an HPC expertise to help make our products better. I certainly cannot speak for others, but I suspect that this rationale is similar to why other vendors participate in HPC open source as well.

Let's not forget that participation in HPC open source helps everyone -- to include the overall size of the HPC market. Here's one example: inter-vendor collaboration, standardization, and better interoperability means happy customers. And happy customers lead to more [happy] customers.

We have learned many things while participating in large open source projects. Below are a few of the nuggets of wisdom that we have earned (and learned). In hindsight, some are obvious, but some are not:


  • Open source is not "free" -- someone has to pay. By spreading the costs among many organizations, we can all get a much better rate of return on our investment.
  • Consensus is good among the members of an open source community (e.g., some members are only participating out of good will), but not always possible. Conflict resolution measures are necessary (and sometimes critical).
  • Just because a project is open source does not guarantee that it is high quality. Those who are interested in a particular part of a project (especially large, complex projects where no single member knows or cares about every aspect of the code base) need to look after it and ensure its quality over time.
  • Differences are good. The entire first year of the Open MPI project was a struggle because the members came from different backgrounds, biases, and held different core fundamentals to be true. It took a long time to realize that exactly these differences are what make a good open source project strong. Heterogeneity is good; differences of opinion are good. They lead to discussion and resolution down to hard, technical facts (vs. religion or "it's true because I've always thought it was true"), which results in better code.

True open source collaboration is like a marriage: it takes work. A lot of hard, hard work. Disagreements occur, mistakes happen, and misunderstandings are inevitable (particularly when not everyone speaks the same native language). But when it all clicks together, the overall result is really, really great. It makes all the hard work well worth it.

December 14, 2007

Ten Years of Distributed Computing with distributed.net

distributed.jpg

Grids come in many shapes and forms, and one of them is the Global Public Grid. Often presented in a philanthropic wrapper, these grids harness the power of thousands, if not hundreds of thousands of computers, often residential personal PCs. Among the most well-known to the general public are Seti@Home, Folding@Home, IBM's World Community Grid and United Devices' now-retired Cancer Research project. Apart from running the Cancer Research project as an employee of United Devices and my involvement with the World Community Grid as a vendor to IBM, I have been involved in a somewhat lesser known but much longer-running public grid project.

One of the longest-running public grid projects is distributed.net, a project that I have been part of since the inception in early 1997. Earlier this year, we celebrated 10 years of "crunching", contributing to various projects in such fields as cryptology and mathematics. It might be a lesser known project in the eyes of the larger public, but still has generated a lot of participation amongst computer enthusiasts and it even won a few awards, most notably a recognition of Jeff Lawson, my coworker and one of the founders of distributed.net, as the most notable person in IT for 1997 by CIO Magazine.

Back in 1997, distributed computing was a very novel concept, but it got jumpstarted by RSA's "RC5-32 Secret Key Challenge", which set out to prove that 56-bit RC5 was not a secure algorithm anymore due to increased speed of computers. In early 1999, distributed.net also proved that DES, another 56-bit algorithm, was getting weak by brute-forcing a secret message in 22 hours, 15 minutes and 4 seconds.

Over the years, distributed.net has undertaken 3 RC5 projects, 2 OGR projects and 3 DES projects, by utilizing over 300,000 participants running on 23 different hardware and software platforms, and it's still going strong. In October 2007, various staff members of our global team came to Austin Texas for a "Code-a-thon", working on the statistics back end to provide our "members" with better individual stats, and a new project that we're planning to roll out in the next couple months.

So if you have any spare cycles on your home computers to spare, why not give distributed.net a try.

December 12, 2007

Proper Testing Environments

Inspect your systems carefully before putting them in production!

It continues to amaze me how many businesses do not have tiered development environments.   Moreover, many of these same companies maintain a very sophisticated production environment with strict change management procedures.  Yet somehow they feel it is apt to keep inadequate staging environments. 

However, we know better: a truly supportive development environment, in the parlance of the agile-development community, must contain a series of independent infrastructures each serving a specific risk-reduction purpose. The idea is that by the time a release reaches the production environment, all of its desired functionality should have been proven as operational.  A typical set of tiers might include:

  • Development – an uncontrolled environment
  • Integration – a loosely controlled environment where the individual pieces of a product are brought together.
  • Quality Assurance – a tightly controlled environment that mirrors production as closely as possible.
  • Production – where your customers access the final product.

Check out Scott Ambler’s diagram on Dr. Dobb's for a supportive environment to see a logical organization of this concept.

So what happens if you cut corners along the way?   I am sure you all know what I am talking about.  Here are a couple of my past favorites (in non-grid environments):

Situation:  You combine quality assurance with the integration and/or the development environments.
Result: New releases for your key products inexplicably fail in production (despite having passed QA testing) because your developers made changes to the operating environment for their latest product.

Situation: You test a load balanced n-tier product on a small (two to three machines) QA environment.
Result:  The application exhibits infrequent but unacceptable data loss because updates from one system are overprinted by those from another.  This is particular onerous to uncover because the application does not fail in the QA environment.

Presumably, we grid managers do all that we can to provide test frameworks adequate enough to avoid problems such as these. There are many texts that discuss the best practices for supportive development environments. Unfortunately, I have found that many of us forget one of our core lessons: everything becomes much more complicated at grid-scales.  Consequently, we are perfectly willing to use a QA environment scoped for a small cluster.

In particular many of us prefer to limit our QA environments to a few computation nodes.  Thus we choose to run our load tests on our production infrastructure.  Conceptually, this makes sense: we cannot realistically maintain the same number of nodes in QA as are in production, so why keep a significant number when we will end up running some of our tests out there anyway?  Sadly, this approach severely complicates performance measures.

For example, assume that my test plans dictate that I run tests ranging from one to sixty-four nodes for a particular application.  If I run this in production, I am essentially getting random loads on the SAN, network, and even the individual servers to which I am assigned.  Consequently I have to run each individual test from the plan repeatedly until I am certain that I have a statistically significant sample of grid states.  Yet I have only defined my capacity on the grid for the average of the utilization rates during my testing.  Any changes to capacity on the grid such as a change in usage patterns or the addition of resources will invalidate my results.

Clearly, I need to run the application on a segregated infrastructure to get proper theoretical performance estimates.  The segregated infrastructure, like any QA environment, should match production as closely as possible.  However, in order to eliminate external factors that seriously affect performance, it is imperative that you use isolated network equipment as well as storage.  Another advantage of this approach is that we reduce the risk of impacting production capacity with a runaway job.  Similarly it takes a large number of test runs to produce numbers that hope to ignore current load factors and thus approach theory.    Obviously this may impact the grid users’ productivity.

As we noted earlier, we cannot justify a QA environment that is anything more than a fraction of production.  However I am certain that eight nodes is not enough. Certainly QA should contain enough nodes to adequately model the speed-up that your business proponents are looking for in their typical application sets.  It would not hurt to do some capacity planning at this point.  In absence of that, thirty-two computation nodes is the minimum size I would use for a grid which is expected to contain several hundred nodes.   

Finally, once we have a reasonable understanding of the theoretical capabilities of the application, then we should re-run the performance tests under production loads.  This will help us understand the lost productivity of our applications under load.  In turn this could help justify the expense of additional resources even if utilization rates cannot. 

I know you are asking, “how do I justify the expense of a large QA environment?”  Well, just think about the time you will save during your next major change to your operating systems and how you have to test ALL of the production applications affected before you migrate that change into production.  Would you prefer to do this on a few nodes, take several out of production, or just get it done on your properly sized test environment?

December 10, 2007

Cupboard Computing

Grid technology will help keep you from pulling your hair out (offer void where prohibited)

In a recent project, I worked with a large company as an impartial outsider and advisor, evaluating their grid computing effort(s). In their various business units, they used various kinds of distributed computing solutions that had spontaneously emerged over time, mostly out of necessity. Basically, they had a mix of everything from quick hacks in perl to specialized commercial solutions, where the annual software licenses cost more than the hardware it run on.

Understandably, the IT department wanted to do some pruning in this vast flora of solutions, take control of the situation and define a strategy and roadmap for a single, enterprise-wide grid computing infrastructure.

The investigation that followed gave me a fresh reminder how much cool stuff you are able to do with little effort: I found several examples where a couple of days of work had cut processing times by factors of 20 or 50, and how the additional processing power had quickly transformed the way in which people did business altogether, now that reports and computing results come back after an hour instead of the next morning, after a night of mainframe batch processing.

But, these deployments were made by the "wrong" people in the organization, and most often in response to urgencies, and these solutions have a tendency to quickly become permanent. This was exactly what had happened here. My personal favorite was one of their mission-critical applications where processing had increased in volume over time and in response people had added additional equipment by way of spare test equipment and old desktop clients that the IT department wasn't aware still existed (so no OS patches applied in the past 2 years).

Next, all this scrap equipment was connected via a series of power strips to a single electrical wall outlet and stacked up on a wooden IKEA shelf at the back of a non-ventilated small cabinet, originally intended for spare office supplies. Security, monitoring and failover were missing altogether, messages were exchanged via serialized python objects over raw TCP sockets. And so on. I don't know how many aspects of the company's established operations policies that this installation broke. But their self-invented load balancing algorithm worked flawlessly!

(I should point out that this had gradually evolved over a long period of time: initially, a small policy violation was excused in the panic of trying to find a quick solution to an urgent problem. Then the snowball started rolling and eventually everyone assumed that things were more or less OK as noone else reacted at the situation. Don't think this cannot happen at your workplace.)

Lessons? Probably many, but perhaps most important: Grid computing is less about technology than it is about operations and management. And, don't take the current trend of business-driven IT to an extreme. :-)

Footnote: One of the repeated complaints from the users that I've heard on general-purpose grid middlewares (the Globus Toolkit in particular) has been the size and complexity of the software distribution. I have begun to re-appreciate the complexity and rich feature set as it is required if you want to do grid computing, with security and monitoring and policies and lifecycle management and whistles and bells, as opposed to cupboard computing, which solves the fundamental business needs but will make your security officer and IT colleagues scream and jump out the window.

December 07, 2007

Get Get Get Get Get: Moving Files Faster Using GridFTP Pipelining

GridFTP moves data along multiple paths, like a highway!

Globus GridFTP moves bytes down the network pipe fast, especially when you have tuned the TCP window size and are using multiple parallel data channels. One can most easily see the performance gains when moving large files across WANs with larger latencies.

If you have tried, however, moving lots of small files (LOSF) across a WAN with larger latencies you have noticed that the GridFTP performance degrades considerably. Depending on just how small the files being transferred are you can easily lose an order of magnitude in total throughput. This performance hit for moving many small files has come to be known as the GridFTP LOSF problem.

The problem with moving LOSF on high latency networks is that after the last bytes for a file have arrived on the client side no more bytes are transferred from the server until the client can send the next FTP 'GET'. The delay between the time when the server sends the last bytes of file until it has received the next GET and sends the first bytes of the next file can be long enough so that the TCP stack(s) begin to close the TCP window. When the next file transfer finally starts the TCP stack(s) have to ramp up again to the appropriate size and "fill the pipe" anew.

The obvious workaround is a good one--if you can move one file that is one terabyte (TB) in size rather than a million files each 1 megabyte in size you should do it. If you can easily structure your data flow to move larger files than GridFTP will reward you.

But what if you don't have the flexibility to chunk you data into larger files and you are stuck needing to move LOSF?

The GridFTP guys have a solution for you in the development release of the Globus Toolkit (GT). The solution is client support for pipelining--many 'GET' commands can be sent to server at one time so that the server just keeps sending bytes down the pipe. With the constant stream of bytes going into the pipe the TCP window doesn't get a chance to close and the throughput remains high.

You can find some nice papers to read about pipelining in GridFTP but if you are like me you want to try this on your own network and get the hard numbers.

Don't be afraid that the pipelining is only available in the development release of the GT. It is easy to try it out, especially since the server side functionality has been in place for a while in the GT 4.0.x series. So you can use the GridFTP servers you already have deployed for testing the new client pipelining functionality.

Follow this easy recipe to give pipelining with GridFTP a test run. You will need two production GridFTP servers (one on each side of the WAN) and the appropriate credentials and authorization in place (at this time the client in the development version only works with third-party transfers between two servers):

- download the latest development release of GT

- tar -zvxf gt4.1.2-all-source-installer.tar.gz

- cd gt4.1.2-all-source-installer

- export GLOBUS_LOCATION=/home/pipelinetest

- ./configure --prefix=$GLOBUS_LOCATION

- make gridftp

- make install

- source $GLOBUS_LOCATION/etc/globus-user-env.sh

To see the pipelining in action pick a directory of "small files" on the server side and use the globus-url-copy client from the development GT to move the directory(ies) to a location on the second server:

globus-url-copy -vb -pipeline -p 8 -tcp-bs 1048576 -recurse gsiftp://server01.com/path/to/data/directory/ gsiftp://server02.com/path/to/dest/directory/

After the usual handshake is done you will see output like this:

Pipelining:     gsiftp://server01.com/path/to/data/directory/file01     gsiftp://server02.com/path/to/dest/directory/file01

Pipelining:     gsiftp://server01.com/path/to/data/directory/file02     gsiftp://server02.com/path/to/dest/directory/file02

Pipelining:     gsiftp://server01.com/path/to/data/directory/file03     gsiftp://server02.com/path/to/dest/directory/file03

Then the bytes will start moving at the rates you would normally see for a single large file and your throughput for LOSF will be much improved across the WAN with high latency.

December 06, 2007

Core Concepts: Virtual Cluster Appliances

robot-handshake_000003470462XSmall.jpg

In part two of the series, I want to flash forward to the present to discuss some recent challenges we are addressing for users of the Workspace Service. There will be plenty of time to talk about more fundamental things like "why are people even interested in using VMs for grid computing?" (for now, you could have a look at this paper for some general arguments).

To the fun stuff. Virtual Cluster Appliances are VMs that automatically and securely work together in new contexts. A virtual appliance is commonly defined as a VM that can be downloaded and used for a specific purpose with minimum intervention after booting locally. We could say then that a virtual cluster appliance is a set of VMs that can be deployed and used for a specific purpose with minimum (most usefully ZERO) intervention after booting.

Say you would like to start a VM-based cluster and treat it like a new, dynamically-added site on your grid. This will probably have several specialized nodes to do cross-network/organizational work and any number of "work" nodes to run jobs.

A very simple example cluster might have a head node VM image with GRAM, Torque, and GridFTP. And then N compute nodes with Torque's agent on them, all started from a copy of the same VM image.

You already have obtained or created some VM template images, you have access to hypervisors (and time reserved on them), and all of the middleware is in place to deploy everything securely and correctly. (As we will discuss in the series, all of these things might be safe to compartmentalize conceptually but are not trivial to get "right")

But even with all of this in place, how do you start these disparate VMs as a working cluster at a new grid site? Some issues:

  • The network addresses assigned to the VMs in all likelihood be different on different deployment sites, so any configurations based on IPs or hostnames will not work. How does the shared filesystem know which hosts to authorize? How does the resource manager (job scheduler) know where to dispatch jobs? What about X509 certificates that need hostnames embedded in their subjects? Keep adding questions in this vein as the cluster topology becomes more complex.

    You could conceivably try and allow the same image bits to work anywhere by using DNS facades and IP tunnels on the "outside" of the VMs. But this runs into a number of problems, the main ones being performance and the fact that in many applications the local network addresses that the VM "sees" are placed high in the stack of many messages. Using NAT or ALG technologies would either be cumbersome, require a lot of application casing, or would not even work in many application cases (for example when message payloads are signed/encrypted).

  • There may often be site provided services a) that the VM will need to know how to contact, b) that the VM will need credentials in order to contact, and c) that the VM will itself need to authenticate once contacted.

  • There are many cases where a particular onboard policy or configuration is not related to the network or the current deployment site. But it is still very helpful to have the ability to start different clusters with different information specified at deployment time.

    For example, trusted VM template images could be made available to a wide number of users who "personalize" them at deployment, for example only allowing the correct entities to submit work. This could be described as a VM being "adapted to an organization."

    Another example: with contextualization technology the very roles the VM is playing can be picked at deployment time and the same VM image (equal bits) can differentiate into different things after booting based on the instructions provided by the client.

The main idea behind the core contextualization technology is not hard to get. When deploying a workspace that is involved in contextualization, a piece of metadata is included that specifies labels which we call roles. A role is either provided or required. Specific roles are opaque to the contextualization infrastructure which only treats them as unique keys. The only place they are meaningful is inside the VM and when the provided metadata is constructed.

Let's use Torque as an example. The two roles could be named for example "torquemaster" and "torqueslave". You deploy one workspace that provides the "torquemaster" role and 100 that provide the "torqueslave" role. And vice versa for requires. For example, each of the slaves requires knowledge of the torquemaster role.

As you initially deploy different VMs, the metadata about their required and provided roles is added to a secure context. This context serves as a "blackboard" for incoming information that fills in the "answers" to these requirements.

As information becomes available from the deployment infrastructure (e.g. network addresses), from the VMs themselves (e.g. individual SSHd public host keys that were generated at boot which is necessary in the Torque case), or from the grid client (e.g. grid-mapfile contents), it is added to the context. The VMs then retrieve information from the context service specifically tailored for them.

For example the VM destined to be the Torque server contacts the context service and retrieves its list of Torque clients. The contextualization script on the VM looks for a local script called 'torqueclient' and invokes it with a known parameter list. In this manner, you can enable new roles on your VMs without needing to bother with any of the context service interaction. And this is why you can use one VM image to play several different roles (you can house config scripts for many roles on the VM but only a certain subset of them could be 'activated' by the deployment metadata specification).

I left out a lot of the mechanics/ordering here which will be covered in subsequent entries. Right off the bat you may be asking yourself "how does this all happen safely without trusted keys set up out of band" and the answer is that it doesn't, that is set up before the call to the context service. You're also asking "isn't it bad to require the VM to be rigged with a context service address to contact ahead of time?" and the answer is that it isn't actually rigged but supplied at boot.

The crux of the solution for these issues is that a) one can bootstrap VM deployments with enough information to set up secure channels and b) one can do this bootstrapping at deployment time thus not interfering with the very powerful mechanism of using secret-less, policy-less VM templates. We'll look at these topics in future entries.

The current implementation is being used to start 100 node clusters deployed by the STAR community on Amazon's Elastic Compute Cloud (EC2). The on-VM context script is a standalone Python file which is slim on dependencies which should suit all but the most extremely bare VMs. As Kate mentions in the recent TP1.3 release announcement, we are working on including this technology in an upcoming release!

December 04, 2007

What You Need to Know About Cluster Express 3.0

volunteers_000000723468XSmall.jpg

At Supercomputing 2007, Univa UD launched Cluster Express 3.0 beta. If you were at SC'07, you might have attended one of my demos on Cluster Express, but if you missed it, then this blog post is for you. I will cut through the marketing speak for you and, as an engineer who worked on the CE3.0 release, tell you what Cluster Express can mean to you. 

Cluster Express is designed to be your one-stop-shop for a full cluster software stack. This means we bundle the scheduler, the security framework, the cluster monitoring and an easy installer that will configure everything out of the box. On top of that, the whole solution is open sourced, including all the code that Univa UD contributed to the stack. You can go to our new community at www.grid.org and download the CE3.0 beta and its sources right now.

So let's go through all the components in more detail.

Installer
Our installer is an very simple utility that will ask you less than 5 questions, after which it will go off and install the main nodes, the execution nodes, and any remote login nodes. It then will tie all these nodes together through a bootstrap service that is installed on the main node. This lets all the other nodes retrieve configuration information from the main node. The end result is that a fully configured cluster emerges, with sensible default configuration for the Grid Engine scheduler and the Ganglia monitoring, and all the certificates for security and authentication set up properly.

Scheduler
We bundle Grid Engine and the installer will configure all the nodes in such a way that after running the installer on an execution node, this node will be part of the cluster automatically, including sensible defaults for queues, and communication and scheduling settings.

Monitoring
We bundle, install, configure, and use various cluster monitoring tools such as Ganglia and ARCo and tie everything together in a custom Monitoring UI that we wrote and delivered as part of the CE3.0 release. The Monitoring UI is not a third-party bundled tool but really a new add-on to our solution. It brings together the system level statistics that Ganglia offers with the job level statistics that ARCo logs from Grid Engine. By presenting them together in one UI, you can cross reference jobs with the nodes that they ran on, and  the loads on that host. This will allow you, for instance, to instantly realize what the impact of running a job or task is on a certain nodes, in real-time and through an easy-to-use graphical UI.

Security
We bundle and pre-configure many Globus Toolkit components such as MyProxy, Auto-CA, RFT, WS-GRAM, GridFTP and GSI-OpenSSH. Auto-CA and MyProxy are completely configured out of the box, so that the only thing you need to do is a simple myproxy-logon to acquire a token that is valid for use with all the other Globus commands such as globus-url-copy or globusrun-ws. The level of integration that we accomplished for all the GT components will definitely impress you, especially if you've been a Globus user before.

Putting it all together
As said, the full bundling of all the above mentioned components in a tarball with an easy-to-use installer now makes setting up a fully featured cluster as simple as downloading one file and running one command. This is really as easy as we could make it! And  on top of that, everything is open-sourced, including our own add-ons such as the installer and configuration scripts, and the Monitoring UI.

I hope that I can welcome you soon on our new community website around Cluster Express at www.grid.org. You can download the CE3.0 tarball there, and participate in forums, add to our wiki, or get support through our mailing lists.

I'm user "Leto" on grid.org, please don't hesitate to send me a private message there if you need any help at all.

December 03, 2007

How to Enable Rescheduling of Grid Engine Jobs after Machine Failures

Jobs are packaged up and shipped to available resources in the case of a failureCheckpointing is one of the most useful features that Grid Engine (GE) offers. As status of checkpointed jobs is periodically saved to disk, those jobs can be restarted from the checkpoint in case they do not finish for some reason (e.g., due to a system crash). In this way, any possible loss of processing for long running jobs is limited to a few minutes, as opposed to hours or even days.

When learning about Grid Engine checkpointing I found the corresponding HowTo to be extremely useful. However, this document does not contain all the details necessary to enable checkpointed job rescheduling after machine failure. If you'd like to enable that feature, you should do the following:

1) Configure your checkpointing environment using “qconf -mckpt” command (use “qconf -ackpt” for adding a new environment), and make sure that the environment’s “when” parameter includes letter ‘r’ (for “reschedule”). Alternatively, if you are using the “qmon” GUI, make sure that the “Reschedule Job” box is checked in the checkpoint object dialog box.

2) Use “qconf -mconf” command (or the “qmon” GUI) to edit the global cluster configuration and set the “reschedule_unknown” parameter to a non-zero time. This parameter determines whether jobs on hosts in unknown state are rescheduled and thus sent to other hosts. The special (default) value of 00:00:00 means that jobs will not be rescheduled from the host on which they were originally running.

3) Rescheduling is only initiated for jobs that have activated the rerun flag. Therefore, you must make sure that checkpointed jobs are submitted with “-r y” option of the “qsub” command, in addition to the “-ckpt < ckpt_env_name >” option.

Note that jobs that are not using checkpointing will be rescheduled only if they are running in queues that have the “rerun” option set to true, in addition to being submitted with “-r y” option. Parallel jobs are only rescheduled if the host on which their master task executes gets into an unknown state.