Stampede Deployment

Warning

This documentation is outdated and included only for reference. This was out-of-date and no longer working on Stampede 1.

This also does not consider the Python QueueRunner.

Described here is TonyB’s deployment of ADAPT to run as a QueueRunner on the Stampede supercomputer. A similar setup can be run on any system, whether a cluster of machines or a stand-alone machine.

MediaFiles

The ADAPT QueueRunner will need access to the MediaFiles for which it will be running jobs.

A copy of the MediaFiles are stored in a shared location for the ‘hipstas’ group on Corral, the long-term redundant storage system. However, running nodes on Stampede have no access to files on Corral, so the files in use will need to be manually copied to the user’s ${SCRATCH} file system.

These files will need to be manually sync’d from ARLO to Corral on a periodic basis.

When a test is ready to be ran, the copy on Corral should be manually updated, as well as the files in use being copied over from Corral.

File Locations

  • Shared Mirror of Files on Corral
    • /corral-repl/utexas/hipstas/arlo/user-files
  • User’s local copy for running on ARLO
    • ${SCRATCH}/user-files/

To sync the files, I use a variation of the following script:

#!/bin/bash

TEMPUSER="PennSound"

echo "=============================================="
echo " Syncing !!! $TEMPUSER !!! from BigD to Corral"
echo "=============================================="

rsync -rlptvz --bwlimit=50000 --stats --progress --delete --include="*.wav" --exclude='*cache*' --filter='-! */' arlo@bigd.ncsa.illinois.edu:/data/user-files/$TEMPUSER /corral-repl/utexas/hipstas/arlo/user-files/

echo "================================================="
echo " Syncing !!! $TEMPUSER !!! from Corral to SCRATCH"
echo "================================================="
rsync -rlptv --stats --progress --delete --include="*.wav" --exclude='*cache*' --filter='-! */' /corral-repl/utexas/hipstas/arlo/user-files/$TEMPUSER ${SCRATCH}/user-files/

ADAPT

Code

ADAPT lives in my home directory at ~/arlo/adapt

For the initial checkout:

$ mkdir -p ~/arlo/adapt
$ git clone https://bitbucket.org/arloproject/arlo-adapt.git ~/arlo/adapt/

Update as necessary with:

$ cd ~/arlo/adapt/ && git pull

Misc

Apache Ant

Our build scripts will use Apache Ant building the executables. We need to manually download the Ant binaries - I am currently using 1.8.4 (apache-ant-1.8.4-bin.tar.gz). I have extracted this to ~/apache-ant/apache-ant-1.8.4/

Aparapi

We need to install the Aparapi library. I keep this at ~/arlo/aparapi/aparapi-2013-01-23/ (currently using version 2013-01-23).

Logs

Create our log storage directory:

mkdir ~/arlo/log/

Scripts

I have added several scripts for convenience:

  • ~/arlo/clean-java.sh

    #!/bin/bash
    
    CWD="${HOME}/arlo"
    
    rm -r ${CWD}/adapt/bin/arlo/*.class
    
  • ~/arlo/build-java.sh

    #!/bin/bash
    
    ANT_BIN="${HOME}/apache-ant/apache-ant-1.8.4/bin/ant"
    
    CWD="${HOME}/arlo"
    
    cd ${CWD}/adapt && ${ANT_BIN} -f adapt_ant_build.xml
    exit $?
    
  • ~/arlo/start-java-oneshot.sh

    #!/bin/bash
    
    ##############################
    # Start just the Java service
    
    #set umask so group can share files
    umask 002
    
    # directory contains nester, adapt, log, etc.
    CWD="${HOME}/arlo"
    
    # Garbage Collector Settings
    GC_SETTINGS=" "
    GC_SETTINGS="${GC_SETTINGS} -XX:MaxHeapFreeRatio=20"
    GC_SETTINGS="${GC_SETTINGS} -XX:MinHeapFreeRatio=5"
    GC_SETTINGS="${GC_SETTINGS} -Xincgc"
    GC_SETTINGS="${GC_SETTINGS} -verbose:gc"
    GC_SETTINGS="${GC_SETTINGS} -XX:+PrintGCDetails"
    
    # Log Files
    NOW=$(date +"%d%b%y.%H%M%S")
    ARLO_JAVA_LOG=${CWD}/log/arlo.java-${NOW}.out
    ARLO_JAVA_LOG_CURRENT=${CWD}/log/arlo.java-current.out
    touch $ARLO_JAVA_LOG
    ln -sf $ARLO_JAVA_LOG $ARLO_JAVA_LOG_CURRENT
    
    # Java Paths
    export JAVA_HOME=/usr/java/default/
    APARAPI_LIB_PATH="${CWD}/aparapi/aparapi-2013-01-23"
    export PATH=$PATH:${JAVA_HOME}/bin:$APARAPI_LIB_PATH
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$APARAPI_LIB_PATH
    JAVA_BIN=${JAVA_HOME}/bin/java
    
    export CLASSPATH="${CLASSPATH}:${APARAPI_LIB_PATH}/aparapi.jar"
    
    # Build Java ClassPaths
    export CWD
    source ${CWD}/adapt/classpath-defs.sh
    
    RUN_ONE_SHOT="-DADAPT.QueueRunnerOneShot=true"
    
    PROCESS_DESCRIPTION="Stampede OneShot QueueRunner - JobId: ${SLURM_JOB_ID}"
    
    ${JAVA_BIN} -server -Xms16G -Xmx16G \
      -DARLO_CONFIG_FILE=ArloSettings.properties \
      -DADAPT.processDescription="${PROCESS_DESCRIPTION}" \
      -DADAPT.userFilesDirectoryPath=${SCRATCH}/user-files \
      -DADAPT.mediaRootDirectoryPath=${HOME}/arlo/nester/media \
      ${RUN_ONE_SHOT} \
      ${GC_SETTINGS} \
      -Djava.library.path=${APARAPI_LIB_PATH} \
      -Dcom.amd.aparapi.executionMode=JTP \
      -classpath ${CLASSPATH} \
      arlo.ServiceHead
    

Updating and Building

The Java build has to happen on one of the Stampede compute nodes, not on the login nodes themselves. From here, the following shows the steps that need to happen on the login node as well as on the compute node via an interactive session.

login4.stampede(50)$ cd ~/arlo/adapt/
login4.stampede(51)$ git pull
Already up-to-date.
# Now we login to the compute node
login4.stampede(52)$ srun --pty -p development -t 10:00 -n1 /bin/bash -l
-----------------------------------------------------------------
              Welcome to the Stampede Supercomputer
-----------------------------------------------------------------

# <snip>

# Clean if we have an existing build
c557-402.stampede(7)$ ${HOME}/arlo/clean-java.sh
# Build
c557-402.stampede(8)$ ${HOME}/arlo/build-java.sh

# <snip>

BUILD SUCCESSFUL
Total time: 3 seconds
c557-402.stampede(9)$ exit
logout
# Back to the Login Node
login4.stampede(53)$

Settings

As shown in the script above, I keep the ARLO settings in ~/arlo/adapt/ArloSettings.properties - note that it can be useful to have several different settings files and multiple scripts for various configurations that may be launched differently.

Launching Stampede Jobs

I use Slurm’s batch processing tools for launching jobs on Stampede.

However, due to memory leaks in Adapt, we don’t want to run a large number of tasks in one process. Instead, let’s use a workaround to re-launch Adapt for each task.

You need to ensure that you have specified a limited number of OneShot tasks in the ArloSettings.properties

QueueRunnerOneShotMaxTasks=1

Run Script

  • ~/arlo/run-oneshot-x25.sh

    #!/bin/bash
    
    # Run 25 Iteration of OneShot
    
    cd ${HOME}/arlo
    for i in {1..25}
    do
            ./start-java-oneshot.sh
    done
    
  • ~/arlo_batch_oneshot.sh

    #!/bin/bash
    #SBATCH -J arlo           # job name
    #SBATCH -o arloout.o%j       # output and error file name (%j expands to jobID)
    #SBATCH -p normal     # queue (partition) -- normal, development, etc.
    #SBATCH -n 1
    #SBATCH -N 1
    #SBATCH -t 03:55:00        # run time (hh:mm:ss)
    #SBATCH --mail-user=<ENTER YOUR EMAIL HERE>
    #SBATCH --mail-type=all
    arlo/run-oneshot-x25.sh
    

Now we can launch a node and run our job with:

sbatch arlo_batch_oneshot.sh

Each invocation will queue one node. We can add multiple with something like

for i in {1..5}; do sbatch arlo_batch_oneshot.sh; sleep 1; done