« October 2007 | Main | December 2007 »

November 2007

November 27, 2007

Unleash the Monster: Distributed Virtual Resource Management

bolts_000000884335XSmall.jpg

Recently we explored the concept of a virtualized grid: a system where the computation environment resides on virtualized operating environments. This approach simplifies the support of the grid-user community’s specialized needs. Further, we discussed the networking difficulties that arise from instantiating systems on the fly including routing and the increased network load distributing images to hypervisor would create. However we have not yet discussed how these virtualized computational environments would come to exist for the users at the right time.

The dominant distributed resource management (DRM) products do not interact with hypervisors to create virtual machines (VMs). Two notable exceptions are Moab from Cluster Resources and GridMP from Univa UD.  Moab supports virtualization on specific nodes using the node control command (mnodectl).  However they are not created on the available nodes as needed.

Consequently grid users who wish to execute their jobs on a custom execution environment will have to follow this procedure:

  • Determine which nodes were provided by the DRM's scheduler.  If any of these nodes are running default VMs for other processes, these may need to be modified or suspended in order to free up resources;
  • Create a set of virtual machines on the provided nodes;
  • Distribute their computation jobs to each of those machines once they are sure they have entered a usable state;
  • Monitor computation jobs for completion; and
  • Finally,once you are certain the jobs are complete, tear down the VMs.  You may be required to restore any VMs that existed before you started.

Sadly, the onus is on the user to guarantee that there are sufficient images for the number of requested nodes.  They are also required to notify the DRM which resources it will take during the computational process.  If this is not done, additional processes could be started on the same node and resource contention could result.

In addition to the extra responsibilities put upon the grid user, they will also lose many of the advantages that resource managers typically offer.  There is no efficiency associated with managing the resources  of the VMs beyond their single use.  If a particular environment could be used repeatedly, that operation must be managed by the user.  Also, the DRM can only preempt the job that started the virtual machines and in turn the computational jobs.  If this process is preempted, then neither the computational job nor the VMs will be  affected.  If other jobs are typically run on a default VM, there could be issues.  Finally, the user may lose some of the more sophisticated capabilities built into the resource manager (such as control over parallel environments).

All of these issues could be solved by tightly integrating the DRM with the dominant VM hypervisors (managers).  The DRM should be able to start, shutdown, suspend, and modify virtual environments on any of the nodes under its control.  It should also be able to query the state of the physical machine and all of its operating VMs.  Ideally either the industry and/or our community would come to consensus on an interface that all hypervisors should expose to the DRM.  If we put our minds to it, we could describe any number of useful features that a DRM could provide when integrated with virtual machine managers; these concepts simply need to be realized to make this architecture feasible. 

Here are my thoughts about what a resources manager in a virtualized environment might provide:

  • It could be able to rollback an image to its start state after a secure process was executed on it.
  • It could be aware of the resources each VM were limited to so that it could most efficiently schedule multiple machines per physical node.
  • It should distinguish between access-controlled VMs versus public instances to which it may schedule any jobs.
  • It should stage the booting of VMs so that we do not flood the network by transferring operating system images.  A sophisticated DRM might even transport images to local storage before the node's primary resources are free.  Readers of the previous posts will recall that the hypervisor interactions should be on a segregated network so as not to interfere with the computational traffic. 
  • It could suspend VMs as an alternative to preempting jobs.  Similarly, it could suspend a VM, transport its image to another physical node, and restart it.  If the DRM managed output files as resources, it could prohibit other processes from writing to the files still open from the suspended systems.
  • It could run specialized servers for two-tier applications and modify the resource allocation for the VM should it become resource constrained.

I am sure that other grid managers could improve on as well as append this list with other excellent ideas. 

In summary, we have examined the flexibility that a grid with virtualized nodes provides.  As clusters evolve from dedicated systems for a homogeneous user community into grids serving a diverse set of user requirements, I believe that grid managers will require the virtualized environment that we have been exploring.   Clearly the key to creating this capability is to integrate hypervisor into our resources managers; without it, VM management is simply too complicated for the scale we are targeting.

Thus far, nothing that we have explored helps us manage and describe the dynamic system that this framework requires (as I am sure you have noticed).  Is this architecture a Frankenstein's monster that will turn on its creators?  That said, next time we will explore how we might monitor  and create reports for a system that changes from one moement to the next.   

November 19, 2007

Five Ways to Improve Your Hiring Tactics

help-wanted_000002021513XSmall.jpgThe company I work for, Univa UD, is hiring and I was sitting down with one of the managers to talk about approaches. Since long before I joined Univa UD, I've been very interested in recruiting as I ran a few small companies and hired on an international basis. Recruiting is the single most important thing that we do. Everything else -- serving our customers, building insanely great products, profiting or creating a fun workplace -- is the result of hiring well.

Talking with that hiring manager, we put together the essential five tactics to managing the candidate acquisition process:

  • When using an online system, buy a multi-month plan. Even if you are feeling a cost crunch, it's unrealistic to believe that the right candidate will walk through the door during the first couple weeks that you're engaged in a search. If you are staffing more than one position, this is even more true. We will be hiring for more than a month, we're going to buy access for more than a month.
  • Spread the work across multiple hiring managers. Recruiting is work. This is important, so I'll say it again. Recruiting is work. Treat it like the important work that it is and make sure the right people are involved in the process. If there are folks who are wordsmithing geniuses get them involved in the production of the posts. If you have people that are brilliant at interviewing, make sure they are talking with candidates even if they will report to another manager. Conversely, your companies success depends on everyone performing well. Be generous with your time and add value to the hiring processes of the other teams in your company.
  • Spend a couple hours reading sites like Copy Blogger. It has great tips on making your writing better and, let's face it, a help wanted ad is marketing. We need to stand out among the thousands of other companies if we want to attract the best people.
  • Update your ads at least once each week. This shouldn't be a rewrite, but each time you update your ad it pops back to the top of the search stack. This may seem like gaming the system, but it works. When I had a break in hiring this summer and I stopped doing this I noticed immediately the tail off in responses as each week ticked by.
  • Don't post a requirements list, tell a story. The posting may well be the only contact you have with people. The posting needs to draw them in and compel them to make the next step and respond to your ad.

And, in the spirit of giving out a free lunch for thanksgiving, a bonus tip:

  • Find more ways to to get the word out. Use your blog, LinkedIn account or Facebook to let people in your community know that you are hiring. Reach out to people in as many ways as you can think of, the IT market is competitive again and you can't stand still waiting for people to come to you and expect to make great hires. Get out there and be great at recruiting, it's the most important thing you can do!


November 16, 2007

How to Improve qconf Productivity

construction_000004081191XSmall.jpg

qconf is without a doubt already a very powerful tool when it comes to administering you Grid Engine installation, but with a little shell foo, it can become even more powerful.

How often have you started to type 'qconf -mq ' before realizing you don't know the exact name of your queue, necessitating a quick 'qconf -sql' first. After Dan Templeton shared a very useful Grid Engine Cheat Sheet with me a few weeks ago (also see this announcement on gridengine.info), I realized that many commands share the same drawback.

Well, bash's autocompletion framework can be put to good use here. See, many people don't know that bash's autocompletion can complete just about anything, not just filenames.

You can download my qconf_completion.sh script from my website, simply dump it in e.g. $SGE_ROOT/util and add the following line to the end of your $SGE_ROOT/$SGE_CELL/common/settings.sh:

. $SGE_ROOT/util/qconf_completion.sh

Now you go from the wieldy:
qconf -mq
qconf -mq^H^H
qconf -sql
qconf -mq thisodd.q

To:
qconf -mq t[TAB] and you're done.

So far I've only implemented most of the interactive options, and have not worked on autocompleting options such as -aattr or -mattr yet, although I see huge possibilities for productivity improvement there. By the way, this exercise was a great way to re-appreciate the intricate regularity that underlies the option set.

Enjoy, and leave a comment if you like the script or have suggestions for improvement.

November 14, 2007

HealthGrid Comes to Chicago, June 2-4 2008

logo.png

The opportunities to apply grid  computing methods in health care are, simply put, enormous. (Irving Wladawsky-Berger refers to it as the "ASCI of Grid" to imply that the challenges are comparable in their extreme scale to those tackled by the DOE ASCI program in simulation. That is an understatement.) There is an urgent need for community, best practices, standards, and the like.

These considerations motivated the formation of the HealthGrid.US Alliance (HG.US), a partnership of scientific, medical and technology professionals from academia, industry and government, whose shared mission is to promote the application of advanced information technology to solve cutting-edge problems in Biomedical Science and Healthcare. HG.US is an affiliate of the international HealthGrid Association.

As a first action, HG.US is sponsoring the first HealthGrid Annual Meeting to be held outside of Europe, in Chicago, Illinois, USA, June 2-4 2008. See the announcement (pdf). The previous five meetings (2003-2007, held in Europe, have formal published proceedings that are also available from the website.

Many biomedical and health related problems are characterized by diverse collaborators needing access to great quantities of complex heterogeneous data, which is distributed across multiple computing systems, maintained by loosely connected institutions, often across international boundaries. Example projects addressing these challenges include sharing datasets to enable a cure for cancer (caBIG, ACGT) and science portals that enable neuroscientists to better visualize the morphology of the brain (BIRN). These and other projects have begun to demonstrate the power and potential of the Grid approach in biomedicine.

Initially, Grid technology development was driven by computing needs of the particle physics research community and enabled by the availability of high-performance networks. The term "grid" rapidly evolved toward a concept of ubiquitous and transparent computing to support a wide variety of applications, and builds on the well-known metaphor of the pervasive "electricity grid". Today, the HealthGrid space represents some of the most interesting drivers for progress in knowledge-based ubiquitous and transparent computing.

The international HealthGrid Association, based in Europe, provides a firm conceptual foundation for efforts in the US and is fully supportive of the HealthGrid.US Alliance. A HealthGrid white paper articulates the broad scope of the concept. US government agencies have begun to develop complementary strategies. These have been captured in TATRC's Integrated Research Team strategic report on HealthGrid: Grid Technologies for Biomedicine and the US Government interagency HealthGrid Core Strategic Planning Group.

November 13, 2007

Grid.org Relaunches

logo_grid.pngFor a number of years United Devices operated grid.org as a philanthropic site for cancer research. That mission was completed earlier this year and Univa UD has relaunched the domain to expand the scope of the project to open source cluster and grid management. This will allow many people who want to do large scale computing, but haven't had the ability to use existing tools, to download an easy to use cluster management suite that will allow them to run a variety of applications.

The press release associated with the launch follows:

RENO, Nev. (Nov. 13, 2007) – An online community for open source grid and cluster users, administrators and developers debuted today on the Internet at http://grid.org

Grid.org provides the single aggregation point for information and interaction by the community of users, administrators, and developers interested in a complete open source grid and cluster stack.

The site sponsor, Univa UD, unveiled Grid.org during the Supercomputing ’07 conference.

“The site has been built to support the needs of users of the open source Cluster Express release from Univa UD, which includes many open source components including Grid Engine, Globus and Ganglia,” said Steve Tuecke, co-founder and chief technology officer at Univa UD and a primary architect of the Grid.org community.  “By aggregating information from many distinct open source grid and cluster efforts and facilitating interaction between users who have historically been left on their own to struggle with the integration of these components, Grid.org should be a valuable additional resource not just for new grid and cluster users but also for members of the current open source communities.”

Grid.org is designed as a destination for community members who want to connect easily and productively with those who have similar interests and who want to engage in a vibrant, functioning community of active participants.  At Grid.org, community members can engage with others to discuss issues as well as give and receive help and contribute to the Cluster Express open source software project.

It also will be a resource for professionals who want to learn more about open source grid and cluster computing in general.

Besides providing links to other open source grid and cluster sites, Grid.org will include areas for participants to build their personal professional networks, participate in forums and blogs, access white papers and case studies, explore upcoming events and download free Univa UD Cluster Express open source software for integrated cluster management.

“We are hosting Grid.org to promote the broad adoption of open source grid and cluster technologies,” said Dr. Ian Foster, co-founder and chief open source strategist at Univa UD.  “There are many very good resources relating to open source today, and we want to provide a single site that lets the community navigate this wealth of information and build on it.”

Grid and cluster pioneers will recognize Grid.org as the Web site where Univa UD precursor United Devices operated a public interest Internet research grid with connections to more than 3.6 million devices worldwide, with its primary mission being to demonstrate the power of early grid technology. In this capacity, Grid.org processed data related to cancer, smallpox, and human genome research among other projects.

About Grid.org
Established in 2001, Grid.org is an online community for open source grid and cluster software users, administrators and developers. The site’s current mission is to work with community members to broaden the reach of the site and encourage use of open source technologies for grid and cluster computing at large.  The site provides a single location where open source grid and cluster information can be aggregated so that people with a similar range of interests can easily exchange information, experiences and ideas related to Univa UD’s complete open source grid and cluster software stack.

About Univa UD
Univa UD is the leading provider of open source products for grid and cluster computing environments.  The company’s industrial-strength offerings range from departmental and HPC cluster management to enterprise-wide grids, and represent the proven and cost-effective alternative to traditional proprietary products that customers have been waiting for.  Based on a combination of open source and proprietary components, Univa UD offerings include a downloadable open source cluster management product, a proprietary cluster product with rich functionality, and a comprehensive enterprise grid product based on award-winning technology.  All Univa UD products are run by Fortune 1000 companies in large-scale, production environments.  Univa UD is headquartered in Lisle, Ill. with offices in Austin, Texas. For more information, contact us Univa UD at 1-800-370-5320 or visit us at www.univaud.com .

November 07, 2007

Hookin' Up is Hard to Do

networking_000001861352XSmall.jpg

Previously we discussed the tension that grid managers face when supporting various stakeholders on an enterprise grid.  In particular we concluded that providing isolated virtual operating environments to each of the business units operating in your environment would be the easiest way to meet their competing and divergent needs.  In this post we will explore the networking challenges that a grid of virtualized systems poses.

The primary challenge you face in this architecture is how to connect it all together.  At first glance it seems simple enough: take your current grid, install a hypervisor on each of its nodes, and then start implementing your user’s specific environments.  Sadly, this will probably not work.

In a typical grid you already have to consider the challenges of connecting several hundred compute nodes to one another and a storage network while keeping network latency low. 

In order to illustrate the networking problems you would have in a virtualized grid, consider a system with a significant number of nodes used by several operational units.  For example, imagine a large financial services company that provides banking, brokerage services, insurance, mortgage, and financing.  Each of these business lines, while related, has their own distinct set of business application workflows.   While there may be some overlap of the specific applications used by each of the units, there is little guarantee that each group will use those applications in the same way let alone use the same versions. Worse yet, a business unit may have multiple operational workflows which do not operate in similar environments (e.g. windows versus Linux specific applications suites).  Finally, we grid managers would like to have development, test, and production instances segregated but running on the same hardware . 

It is easy to project having to support at least ten times more virtual than physical operating environments.  The actual number should be proportional to the number of unique operating environments required by the users. In a standard grid you have a fixed set of computational resources that are reasonably static; in other words systems do not appear and disappear on a regular basis.  However in the virtualized grid, operating environments are going to appear and disappear as a function of the business workflows scheduled by your users.  You can imagine how quickly this can become complicated.

What is the best way to deliver these operating environments to the physical hardware?  If we keep all of the images on local disk then we need to guarantee that there is sufficient disk space on each node; a practice which not only can be costly but does not scale well.  If we choose to keep no more than the maximum number of nodes supported by any application in each operating environment, we can reduce the number of virtual machines we require.  Of course this implies that these images are either stored on a SAN or are transported to the individual physical nodes before booting the virtualized environment.  Sadly, both of these approaches significantly increase network loads.  We will discuss scheduling and managing individual virtual machines in subsequent posts.

How do we connect these virtual environments? If these systems were on segregated physical hardware (think Microsoft Windows versus Linux) we would likely keep them on their own network and/or VLANs.  After all, these environments generally should not interact with one another.  Consequently, shouldn’t we also do this for the virtualized grid?  If we chose not to and instead used DHCP based upon physical topology to provide addresses to the virtualized environments, we could quickly run into trouble.  Specifically, a single job executed on n nodes could conceivably land on n distinct networks and/or VLANs.  This would significantly increase the size of the broadcast domain as well as require more work from your network switches.  Therefore it would add significant latency to all communications between the nodes. Clearly this is a poor choice unless you are always using most of your nodes for each job.

Thus my preferred solution is to segregate operational environments, so that every physical node bridges traffic for several distinct networks over the same interface.  Addresses would be assigned by virtual MAC addresses rather than physical location.  As in the counter-example, this occurs because we will not be able to guarantee where on the physical network topology a particular job is scheduled.  In fact, we probably want to use VLAN tags on our packets so that our switches could more efficiently operate.  Additionally if your grid nodes have secondary interfaces, all communication with the hypervisor should be segregated to its own management network.

If this has not scared you away from the concept of  the virtualized grid (I hope it hasn’t), we will continue to explore other hurdles inherent with this architecture in future posts.

November 05, 2007

How to Decipher Grid Engine Statuses – Part II

status-board_000004506559XSmall.jpgIn Part I of this article I’ve discussed meanings of various queue states that one might see after invoking the Grid Engine qstat command. The list of possible job states is just as long as the list of queue states:

• d (deletion) — Indicates that a job has been deleted using qdel.

• r (running) — Indicates that a job is about to be executed or is already executing.

• R (restarted) — Indicates that the job was restarted. This state can be caused by a job migration or because of one of the reasons described in the -r section of the qsub man page.

• s (suspended) — Shows that an already running job has been suspended using qmod.

• S (suspended) — Show that an already running job has been suspended because the queue that it belongs to has been suspended.

• t (transferring) — Indicates that a job is about to be executed or is already executing.

• T (threshold) — Show that an already running job has been suspended because at least one suspend threshold of the corresponding queue was exceeded, and that the job has been suspended as a consequence.

• w (waiting) — Indicates that the job is suspended pending the availability of a critical resource or specified condition.

• q (queued) — Indicates that the job has been queued.

• E (error) — Indicates that the job is in the error state. You can find the reason for this state using the qstat command with “-explain E” option.

• h (hold) — Indicates that the job is not eligible for execution due to a hold state assigned to it via qhold, qalter, or qsub -h command. 

Just like with queue states, one also frequently encounters various combinations of the above job states.

November 01, 2007

Grids, grids, grids: Which side of the pond wins?

Dan Ciruli at West Coast Grid writes

Europe is years ahead of the US in terms of large grids...

Is Europe years ahead of the US?

Open questions that come to mind include:

  • What is a "large" grid?
  • What makes one region "ahead" of another?
  • What makes one region "years" ahead?
  • If one region is years ahead, what are the reasons for it?
  • What of other regions outside of Europe and the US?

Certainly the US and Europe both have some very large grids, so the question is, what was Dan taking into account when making his claim.