LA-MPI
The Los Alamos Message Passing Interface

What's New: Open MPI

LA-MPI is no longer in active development, but is being maintained for use on production systems at LANL, and we welcome other users.

Our future development is focused on the Open MPI project, a new component-based, extensible implementation of MPI-2.

Introduction

LA-MPI is an implementation of the Message Passing Interface (MPI) motivated by a growing need for fault tolerance at the software level in large high-performance computing (HPC) systems.

This need is caused by the vast number of components present in modern HPC systems, particularly clusters. The individual components -- processors, memory modules, network interface cards (NICs), etc. -- are typically manufactured to tolerances adequate for small or desktop systems. When aggregated into a large HPC system, however, system-wide error rates may be too great to successfully complete a long application run. For example, a network device may have an error rate which is perfectly acceptable for a desktop system, but not in a cluster of thousands of nodes, which must run error free for many hours or even days to complete a scientific calculation.

LA-MPI has two primary goals: network fault tolerance and high performance.

Network fault tolerance is acheived by implementing a highly efficient checksum/retransmission protocol. The integrity of delivered data is (optionally) verified at the user-level using a checksum or CRC. Data that is corrupt (or never delivered) is retransmitted.

As for high performance, LA-MPI's lightweight checksum/retransmission protocol allows us to achieve low latency messaging. Furthermore, the flexible approach taken to the use of redundant data paths in a network-device-rich system leads to high network bandwidth since different messages and/or message-fragments can be sent in parallel along different paths. Also, since LA-MPI is developed for use on the the large systems at Los Alamos National Laboratory we have verified that LA-MPI is scalable to over 3,500 processes.

An alternative solution to the network fault tolerance problem is to use the TCP/IP protocol. We believe, however, that this protocol -- developed to handle unreliable, inhomogeneous and oversubscribed networks -- performs poorly and is overly complex for HPC system messaging, and that LA-MPI's lightweight checksum/retransmission protocol is a more appropriate choice.

Features

Platforms

Processors
Intel IA32, Intel IA64, AMD Opteron, PowerPC (G4, G5), Alpha, MIPS
Operating systems
Linux, Linux/Clustermatic, MacOS X, Tru64, IRIX
Interconnects
Shared memory, Ethernet (TCP, UDP), Myrinet (GM), QSNet (Quadrics Elan3), InfiniBand (VAPI), HIPPI-800

Download

The current release of LA-MPI is

Some earlier releases are also available:

LA-MPI is installed in the usual way

  configure [OPTIONS]
  make
  make install

where configure options include

  --enable-debug          enable debugging
  --enable-lsf            use LSF
  --enable-rms            use RMS
  --enable-bproc          use BPROC
  --enable-udp            enable UDP path
  --enable-tcp            enable TCP path
  --enable-qsnet          enable QSNET path
  --enable-gm             enable Myrinet GM path
  --enable-ib             enable InfiniBand path
  --with-romio            include MPI-IO support

Research

LA-MPI is developed by the Application Communications and Performance Research Team of the Advanced Computing Laboratory at LANL. We are actively investigating other aspects of fault tolerance and performance optimization. Topics of current interest include

Many of these ideas will be explored as part of the Open MPI project.

Papers

Also see our Open MPI papers.

Contact

LA-MPI is developed at the Advanced Computing Laboratory of Los Alamos National Laboratory. For more information contact lampi-support@lanl.gov


LANL © 2002-2004 University of California | Disclaimer Valid HTML 4.01! Valid CSS!