Tutorial

This tutorial is designed to teach you about 90% of what you need to know about LoonyBin as quickly as possible. You should pick and choose the topics that best fit your interests. Also, you might want to consider printing this tutorial unless you have a dual monitor setup since you might be flipping back and forth fairly often. If you run into any problems, please let us know on the Mailing List.

In general, I find that using the visual designer for initially setting up complex workflows is much more efficient and results in fewer design errors. And I normally don’t like GUIs. The text-based workflow format and console interface is then the best way to go when small changes need to be made during development.

WARNING: This tutorial is currently in flux in preparation for the V0.6.0 release. You have been warned.

If you’re just curious about LoonyBin and love to type commands at a shell, you should look at

Downloading and Installing on the Home Machine
Compiling and Running a Workflow in a Single Command on the Home Machine

If you’re just curious about LoonyBin and have evolved out of the stone age, you should look at

Downloading and Installing on the Design Machine
Running a Workflow Synchronously on the Design Machine
Understanding Workflow Output
Creating a Trivial Workflow

If you’re serious about using LoonyBin, you might want to follow this path

Downloading and Installing on the Design Machine
Downloading and Installing on the Home Machine
Running a Workflow Asynchronously on the Home Machine
Creating a Trivial HyperWorkflow
Downloading and Installing on Other Remote Machines
Creating a Workflow that Runs on Multiple Machines and/or Schedulers
Dealing with Lots of Realizations
Creating and Using Your Own Tool Descriptor

Summary

Tutorial	This tutorial is designed to teach you about 90% of what you need to know about LoonyBin as quickly as possible.
Downloading and Installing on the Design Machine	This information is also available in the YouTube video: http://www.youtube.com/watch?v=HEm4Mj72LDM
Compiling and Running a Workflow in a Single Command on the Home Machine	Okay, almost a single command...
Opening and Understanding an Example Workflow	After Downloading and Installing on the Design Machine...
Running a Workflow Synchronously on the Design Machine	This is perhaps the fastest way of seeing if you’re interested in LoonyBin, but not something you’ll probably want to do with it day-to-day.
Understanding Workflow Output	Okay, so it worked!
Downloading and Installing on the Home Machine	This section explains how to get the Home Machine (the machine where you will actually execute the generated script) ready for running the example.
Preparing to Run a Workflow Synchronously on the Home Machine	The following information is also available in the YouTube video: http://www.youtube.com/watch?v=_akylEUCIQU
Running a Workflow Synchronously on the Home Machine	Though typically you will probably want to run workflows asynchronously (all vertices with their dependencies satisfied will be run in parallel), there are some situations where you might want to run the workflow synchronously (one vertex at a time).
Running a Workflow Asynchronously on the Home Machine	For a comparison of synchronous vs asynchronous workflows, read the introduction to Running a Workflow Synchronously on the Home Machine.
Creating a Trivial Workflow	In this section, we will recreate the example workflow that we’ve been using earlier in the tutorial.
Creating a Trivial HyperWorkflow	You should have already looked at Running a Workflow Synchronously on the Home Machine or Running a Workflow Asynchronously on the Home Machine.
Dealing with Lots of Realizations	By now, you should have seen Creating a Trivial HyperWorkflow, or else you won’t have many realizations to work with.
Downloading and Installing on Other Remote Machines	LoonyBin distinguishes the design machine (where you define your workflow) from the home machine (the primary machine that executes your workflow, such as Your Favorite Server from other remote machines.
Creating a Workflow that Runs on Multiple Machines and/or Schedulers	By now you should have seen Creating a Trivial Workflow and Running a Workflow Asynchronously on the Home Machine, and you should have just followed Downloading and Installing on Other Remote Machines...
Creating and Using Your Own Tool Descriptor	New tools are written in the Python programming language by implementing the Tool interface.
Creating and Maintaining a Toolpack for Your Organization	TODO...

Downloading and Installing on the Design Machine

This information is also available in the YouTube video: http://www.youtube.com/watch?v=HEm4Mj72LDM

We will be installing LoonyBin on the Design Machine, the machine where you will design workflows (e.g. a personal laptop or a workstation).

Downloading

Go to the Download page and follow it to the SourceForge files page.
Select the most recent version of LoonyBin
Download the file appropriate for your Operating System (zip for Windows, dmg for Mac, and tar.gz for Linux)

Extracting

For Windows/Linux: Extract the file using your favorite utility to the directory where you want LoonyBin to live
For Mac: Open the DMG and drag the LoonyBin directory to your Applications folder

Running the Graphic Interface

Go to the folder where you installed LoonyBin
Double-click the appropriate launcher (.sh for Linux, the .bat for Window, or the app file with the LoonyBin icon for Mac)

(Updated as of V0.5.0)

Compiling and Running a Workflow in a Single Command on the Home Machine

Okay, almost a single command...

# Go to wherever you installed LoonyBin
# (e.g. /Applications/LoonyBin)
$ cd $LOONYBIN
# Edit the example hyperworkflow to reflect your preferred directory structure
emacs tool-packs/example/workflows/example.hwork

Make the following changes in the workflow

Change the home.baseDir to be the directory where you want the output files from this workflow to go
Change the home.pathDir to be the $PATH_DIRECTORY you created during installation
Change the location of this workflow’s single input file “example.txt” to reflect your $LOONYBIN install directory

Now just issue this Single Unix Command ™

# Compile and run the workflow
loon tool-packs/example/workflows/example.hwork

If you want to see what’s going on in a nifty web interface on port 4242 and keep that web UI running even after the workflow completes, you can use:

loon tool-packs/example/workflows/example.hwork --launchUI 4242 --runForever

(Updated as of V0.6.0)

Opening and Understanding an Example Workflow

After Downloading and Installing on the Design Machine...

Open the Example Workflow

From the menu, go to Workflow -> Open.
Browse to the directory tool-packs -> example -> workflows and open example.work

Getting a Feel for Navigation (If you’re in a hurry to produce output, you can skip this section)

Select the “Scrolling” tool and drag around a bit in the Workflow Editor to see how the graph moves
Select the “Selecting” tool and click on each vertex and each edge to see what information is available
Mouseover each vertex and edge to see the tooltips
Use the mouse wheel to zoom and and out (TIP: Try zooming out on the left and zooming in on the right to quickly move to different ends of the workflow. This is especially useful for larger workflows.)
The “Editing” tool is not discussed until Creating a Trivial Workflow

Understand the Example Workflow (If you’re in a hurry to produce output, you can skip this section)

Select the “Selecting” tool
First, click the “FS-input-files” vertex and note that you can give the vertex any name you like in “step name.” This name will be included in the workflow’s output directory structure and in the workflow’s log files
Next are a list of text boxes. The left boxes contain “input names” that are LoonyBin variables that point to absolute paths on some machine. The right boxes contain the absolute path that the input name points to
The “Machine Config” is localhost, indicating that the files are located on the machine where the generated script will be executed.
Next, click the outgoing edge of “FS-input-files” and hover over it with the mouse. Notice that both the right panel and the tooltip indicate that the output name “example.txt” is being provided to the input name “corpusIn” for the next step
Click the next vertex and notice that we will be taking the first 10 lines of the input file.
Hover over “nLines” with the mouse to see the tooltip documentation for the parameter appear

You might have noticed that each of these vertices’ names begin with a number. This is just to make the final output on the file system easy to view in chronological order. The numbering scheme is similar to BASIC; this is only a preference of the author and LoonyBin does not prevent you from using whatever convention you like.

Next: Running a Workflow Synchronously on the Design Machine

Running a Workflow Synchronously on the Design Machine

This is perhaps the fastest way of seeing if you’re interested in LoonyBin, but not something you’ll probably want to do with it day-to-day. In this scenario, the design machine will also be used as the home machine, where the workflow actually executes. However, since LoonyBin requires that the home machine have some flavor of UNIX installed, it only works on Mac and Linux. Sorry Windows users, you’ll need to go to Running a Workflow Synchronously on the Home Machine to give LoonyBin a try.

Create a Path File

LoonyBin dynamically determines where the necessary binaries are for each tool in a workflow. This allows the binaries to live at different paths on each machine your might execute your workflow on. The only tools LoonyBin needs for this example workflow are its helper scripts, which are already installed on the Design Machine at $LOONYBIN/scripts. The generated script will look for this path inside a file called $PATH_DIRECTORY/loonybin-scripts.path, so we’ll create this file using a UNIX shell:

# Go to wherever you installed LoonyBin
# (e.g. /Applications/LoonyBin)
$ cd $LOONYBIN
# Make a $PATH_DIRECTORY
$ mkdir paths
$ cd scripts
# Put the current path into the loonybin-scripts.path file
$ pwd > ../paths/loonybin-scripts.path

Now let’s see what just happened

$ cd ../paths
# Take a look at the path file
$ cat loonybin-scripts.path
# Remember your path directory, you'll need it later!
$ pwd

You only need to do this once per machine your install LoonyBin on.

For more information on path files, see How does LoonyBin find the tools installed on each machine?.

Update the Filesystem Vertex

Now we need to tell LoonyBin where the example input file is. First, let’s find it ourselves...
Open the directory where you installed LoonyBin
Go to tool-packs/example/example-input
Open the file example.txt just to see what’s inside
Remember the full path to this file (e.g. /Applications/LoonyBin/tool-packs/example/example-input/example.txt)
Now we put this information into the workflow...
Now click on the FS-input-files vertex in the example workflow
To the right of “example.txt”, replace “/usr13...” with the absolute path to example.txt

Generate the Script

Go to the menu item Workflow -> Generate Bash Script
Since we’re not using multiple machines or schedulers, just click “Next”
Notice that the realization set “Default” is already selected
You’ll need to fill in “Path directory” with the $PATH_DIRECTORY you already created
The “Base directory” is the directory under which LoonyBin will put the workflow output. (e.g. /Users/jon/Desktop/try-loonybin)
The “Output file” box allows you to select where the generated bash script will be placed. Use “example.sh”
You can also provide a space-separated list of e-mail addresses that will be notified when the workflow completes or fails. (NOTE: Most cell providers have a email address that will route messages to the texting SMS of your phone. This can come in handy when running critical workflows).
Press OK to generate the workflow.

Execute the Script

Open a Terminal session
”cd” to the directory where you generated the workflow
Type “bash example.sh -run”
Your screen will now be filled with lots of lines of bash executing. LoonyBin shows you every bash command that it issues.
If all went well, you will see a message near the bottom with “WIN”

Next: Understanding Workflow Output.

Understanding Workflow Output

Viewing the Result

Okay, so it worked! ...what did it do? To see, go to the “base directory” you specified when generating the script.

First, it created an output directory for each vertex. And a “default” realization directory under each vertex directory since there was only one realization in our workflow. Using your trusty shell, have a look at:

ls -lh FS-input-files/default/final/example.txt

which is a symlink to the input file you specified. Similarly, you can find the output of the next 2 vertices with:

less 100-head-10/default/final/corpusOut
less 200-head-5/default/final/corpusOut

They should contain the first 10 and first 5 lines of the previous file. Next to the “final” directory, you will also see a “working” directory, which contains all of the intermediate files the vertex produced while running. There is also an “inputs” and “outputs” directory under “working.”

For more info, see Where did the output go? What directory structure does LoonyBin create?.

LoonyBin also produced log files for each step. Each log file contains the log files of all previous vertices as well, since determining a vertex’s dependencies can be non-trivial in HyperWorkflows. For example, have a look at the last loon log:

less 200-head-5-default.loon

Notice that the log file contains information about what machine the vertex ran on, any parameters of the vertex, how long it took the vertex to run, and any logging information from pre/post-analyzers (discussed later).

For more info, see A few words on loon logs.

Next: Creating a Trivial Workflow

Downloading and Installing on the Home Machine

This section explains how to get the Home Machine (the machine where you will actually execute the generated script) ready for running the example. This section assumes you have already followed Downloading and Installing on the Design Machine.

Now we will setup LoonyBin’s helper scripts)

Copy the directory “scripts” to the home machine. For example, you could use:

scp -R $LOONYBIN/scripts user@my.machine:~/loonybin/scripts

If you copied the “scripts” directory from your Windows machine, the permissions will be borked. Fix them using

ssh $HOME_MACHINE

# $SCRIPTS is the directory where you # copied the scripts (e.g. ~/loonybin/scripts)

chmod +x $SCRIPTS/*

Create a path directory and path file

# Remember the location of this $PATH_DIRECTORY for later

mkdir -p ~/myPathDirectory
cd $SCRIPTS
pwd > ~/myPathDirectory/loonybin-scripts.path

loonybin-scripts.path is a special file that loonybin will look in to determine the location of its helper scripts on each machine.

For more information on path files, see How does LoonyBin find the tools installed on each machine?.

For non-trivial workflows (such as those in the Machine Translation Toolpack), you would need to install other dependencies on the home machine as well (For example, using the “install-dependencies.py” included in the MT Toolpack). Doing this yourself is covered in more detail in Creating and Using Your Own Tool Descriptor.

(If you won’t be invoking generated workflow scripts directly from this machine -- i.e. this is a “Remote Machine” -- you can stop here)

Copy the file LoonyBinWorkflowExecutor.sh to the home machine. You will use this script-jar bundle to execute your generated workflows.

Next: Preparing to Run a Workflow Synchronously on the Home Machine

Preparing to Run a Workflow Synchronously on the Home Machine

The following information is also available in the YouTube video: http://www.youtube.com/watch?v=_akylEUCIQU

After following Opening and Understanding an Example Workflow and Downloading and Installing on the Home Machine...

Copy the example input to the Home Machine

Copy the directory tool-packs/example/example-input to some location on the home machine For example:

cd $LOONYBIN
scp -R tool-packs/example/example-input user@myMachine.org:/usr13/jhclark/example-input

You might also want to open example.txt to see for yourself what’s inside (line numbers)

Update the Filesystem Vertex

Remember the full path to the file and directory you copied (e.g. /usr13/jhclark/example-input/example.txt)
Now we put this information into the workflow...
Now click on the FS-input-files vertex in the example workflow
To the right of “example.txt”, replace “/usr...” with the absolute path to example.txt (e.g. /usr13/jhclark/example-input/example.txt)

Generate the Script

Go to the menu item Workflow -> Generate Bash Script
Since we’re not using multiple machines or schedulers, just click “Next”
Notice that the realization set “Default” is already selected
You’ll need to fill in “Path directory” with the $PATH_DIRECTORY you already created
The “Base directory” is the directory under which LoonyBin will put the workflow output on the home machine. (e.g. /usr13/jhclark/example-base-dir)
The “Output file” box allows you to select where the generated bash script will be placed. Use “example.sh”
You can also provide a space-separated list of e-mail addresses that will be notified when the workflow completes or fails. (NOTE: Most cell providers have a email address that will route messages to the texting SMS of your phone. This can come in handy when running critical workflows).
Press OK to generate the workflow.

Next: Running a Workflow Synchronously on the Home Machine or Running a Workflow Asynchronously on the Home Machine

Running a Workflow Synchronously on the Home Machine

A comparison of running workflows synchronously vs asynchronously

Though typically you will probably want to run workflows asynchronously (all vertices with their dependencies satisfied will be run in parallel), there are some situations where you might want to run the workflow synchronously (one vertex at a time). First, a workflow can be run synchronously from a single monolithic generated bash script whereas asynchronous workflows are a zipped collection of shell scripts and meta-data that must be run by the LoonyBin workflow executor. Since asynchronous workflows are run using the executor, they also have the added benefit of the web UI, which is a part of the executor, that displays the current status of the workflow and allows easy access to its output.

After following Preparing to Run a Workflow Synchronously on the Home Machine...

Copy the generated script to the home machine. For example:

cd /directory/where/i/put/the/generated/script
scp example.sh user@myMachine.org:~

Connect to the home machine and execute the script

ssh user@myMachine.org
cd ~

Run the script using the -run flag

bash example.sh -run

Your screen will now be filled with lots of lines of bash executing. LoonyBin shows you every bash command that it issues.
If all went well, you will see a message near the bottom with “WIN”

Next: Understanding Workflow Output

Running a Workflow Asynchronously on the Home Machine

For a comparison of synchronous vs asynchronous workflows, read the introduction to Running a Workflow Synchronously on the Home Machine.

After following Preparing to Run a Workflow Synchronously on the Home Machine...

NOTE: To run the asynchronous version, we will use the generated .async file instead of the .sh file.

Copy the generated script to the home machine. For example:

cd /directory/where/i/put/the/generated/script
scp example.async user@myMachine.org:~

Connect to the home machine and execute the script

ssh user@myMachine.org
cd ~

Run the script using LoonyBinWorkflowExecutor.sh, which you previously installed on the home machine. You will also need to specify a port number on which the web UI will be run (e.g. 4242).

$LOONYBIN/LoonyBinWorkflowExecutor.sh example.async 4242

Understanding the Web UI

Open your favorite web browser and connect to the web UI for the currently-running workflow. This address might look something like: http://myMachine.org:4242
This page will automatically update every few minutes
The “Running Tasks” section will let you know what’s currently running and provide a link to the standard out, standard error, and the generated script file that LoonyBin executed to run the vertex.
The “Completed Tasks” section will let you know what’s finished and provides an aditional link to the .loon log file
The “Blocked Tasks” section shows what vertices haven’t run yet and shows which tasks they are blocking on
The “Failed Tasks” section will let you know when a task has failed, and give you the option to retry it, if you believe you’ve fixed the problem. If an earlier step caused the error, you will need to delete some .loon files and rerun the workflow (see the README for more details)

Next: Understanding Workflow Output

Creating a Trivial Workflow

In this section, we will recreate the example workflow that we’ve been using earlier in the tutorial.

Creating a Filesystem Vertex

Select the “FILESYSTEM” vertex in the “Tool Tree” on the left
Now, click the “Editing” button in the top toolbar
Click in the Workflow Editor to add a Filesystem vertex to the workflow
Use the right panel to give this vertex a name. By convention, the author often prefixes filesystem vertex names with FS so that it is easy to locate the input vertices of the workflow
Click “Add new file”
The left box that appeared represents an “input name” for LoonyBin. It is a variable that takes on the value of an absolute path to the file that the input name represents. If the location of your input file changes later, you will only have to change it in this one place in the workflow rather than at all vertices that use this file. Type “example.txt” in this box
The right box that appeared should contain the absolute path to the input file on the home machine. You can use the path to example.txt from earlier in the tutorial.
In the “Machine Config” drop down at the bottom of the right panel, select “localhost” to indicate that this vertex will run on the home machine as opposed to a remote machine

The filesystem vertex is also one point of integration with HDFS (the Hadoop Distributed File System). You can tell LoonyBin that it should look for a file on HDFS instead of on the UNIX filesystem by just prefixing the input name with “hdfs:” -- For example, if example.txt were located on HDFS, we could write “hdfs:example.txt” and then just put its path on HDFS in the path box to its right.

You should be seeing some strong parallels with Opening and Understanding an Example Workflow by this point.

Adding tool vertices

Expand the tool tree to expose Example -> Organized Subfolder -> Head
Click on Head so that you can see its documentation in the right panel
Click in the Workflow Editor to add an instance of this tool to the workflow
Use the right panel to give this vertex a name. By convention, the author often prefixes tool vertex names with numbers (yes, like BASIC) so that the directories appear in a reasonable order on the filesystem when viewing the workflow output
Enter a number of lines to be taken from the head of the file (e.g. 10)
In the “Machine Config” drop down at the bottom of the right panel, select “localhost” to indicate that this vertex will run on the home machine as opposed to a remote machine

Adding an edge

Drag from the filesystem vertex to the vertex you just added and a new dialog will appear
Drag from “example.txt” to “corpusIn” so that example.txt will be used as the input of the head tool
Notice that you can mouse over any file or parameter to get documentation for it
Notice the “Auto” button on this dialog. Though not useful in our situation, this button will link all inputs and outputs the have the same names

Finally

Repeat this process to add a final vertex that takes the first 5 lines of the previous input. It will be connected to the previous vertex.

Next: Creating a Trivial HyperWorkflow

Creating a Trivial HyperWorkflow

You should have already looked at Running a Workflow Synchronously on the Home Machine or Running a Workflow Asynchronously on the Home Machine. Now we’ll pick up where we left off after Creating a Trivial Workflow.

First, what do I want with a HyperWorkflow?

In the context of research, a workflow typically encodes a set of steps under a fixed set of conditions that leads to a single experimental result (e.g. a vector of 3 numbers). However, good empirical research should always include at least a control group or a baseline system that represents the current state-of-the-art, and usually there is not a fixed set of experimental conditions, but many different tools, parameters, and hyperparameters are tried before settling on a result. This is where HyperWorkflows become useful. They encode multiple experimental conditions while not requiring you to duplicate shared paths in your definition of the workflow nor while the workflow executes.

In LoonyBin, HyperWorkflows are just workflows that contain packing vertices (a.k.a. OR vertices), which act like switches (or multiplexers, for the electrical engineers among you) between their incoming edges. We call a path through a workflow in which we have selected one named input for each packing vertex a realization. A HyperWorkflow encodes many paths (a.k.a. realizations) through the workflow in a compact form (like forests or hypergraphs, for the natural language processing people among you). When there are multiple packing vertices in a row, LoonyBin takes the cross-product between them will produce the result for all combinations of these different conditions. You also have the option of restricting the cross-product to some set of experiments you care about by using the Realization Selection mechanism, described in Dealing with Lots of Realizations.

In the past, workflow management systems have dealt with this issue by having the user (that’s you), copy their entire configuration file for the workflow, change a variable or two and rerun. The best of these systems can automatically determine what dependencies were run in previous experiments and the worst rerun the entire workflow again. However, they all suffer from the problem of clone-and-modify programming (ridiculed as “inheritance” in Abject-Oriented Programming: http://typicalprogrammer.com/?p=8). If a bug was in the original configuration, it is now in all of its clones, too. Good luck finding them all. (AND all of the bug-ridden files they created!)

So that’s why you might want a HyperWorkflow. Now, how do you build one?

Going back to our trivial example, let’s say we want to test the effect of head-5 on 3 inputs, one with head-10 of example.txt, one with head-2 of example.txt, and one with head-4 of example.txt.

Delete the edge between 100-head-10 and 200-head-5
Select the OR tool from the toolbox on the left and add an OR vertex between 100-head-10 and 200-head-5
Name the triangular OR vertex “makeChoice”
Drag to add an edge from 100-head-10 to makeChoice
In the dialog that appears, drag from corpusOut to “OR” to indicate that it will create a new realization and press OK (You can connect as many inputs to this OR as you want)
You will be prompted to rename the output, in case another name make more sense in the context of this packing vertex. Call it “corpus”
Drag to add an edge from makeChoice to 200-head-5, pass “corpus” to “corpusIn” and press OK
This will create an edge with the name “default”

At this point, you have exactly recreated the trivial workflow from before. Because the incoming edge of the packing vertex has the special name “default” it will not affect the output directory structure of the workflow. Now let’s the other two realizations so that we get more results:

Select the Head tool from the toolbox at left, just as in the trivial workflow
Add two instances of it underneath 100-head-10 by clicking
Name the first one “110-head-2” and the second “120-head-4” and assign them to the machine configuration “localhost”
Drag an edge from FS-input-files to each of the vertices and connect “example.txt” to “corpusIn”
Now drag an edge from 110-head-2 to makeChoice
Connect corpusOut to corpus to indicate that in the new realization that we are currently creating, corpusOut will be substituted for corpus (i.e. corpusOut plays the same role as corpus in the other realizations). Press OK
Your workflow editor may briefly turn dark dray to indicate that the workflow is being updated. You can continue to edit during this time
In the same way, create a realization edge from 120-head-4 to makeChoice
Using the Selecting Mode, select the edge from 110-head-2 to makeChoice and name it “h2”
Similarly, select the edge from 120-head-4 to makeChoice and name it “h4”

Your HyperWorkflow is now complete. To generate a script that implements some or all of the realizations it contains, you will need to create more Realization Selections, which we will discuss in the next section of the tutorial.

After you execute this workflow, you will notice that the 200-head-5 vertex directory under the base directory now contains multiple subdirectories, one for each realization. Also, if that vertex had child vertices, they would also contain a subdirectory for each realization.

Next: Dealing with Lots of Realizations

Dealing with Lots of Realizations

By now, you should have seen Creating a Trivial HyperWorkflow, or else you won’t have many realizations to work with.

The problem with just running all realizations in a HyperWorkflow

If a workflow contains 3 packing vertices with 4 input names each, this is 4 x 4 x 4 = 64 realizations for the final vertex. If that vertex takes a day to run, running all possible experiments may be unacceptable.

Realization Sets are the way that LoonyBin handles the exponential explosion in the number of realizations that you get when introducing multiple packing vertices into a workflow. A realization set allows you to select a few vertices as a “goals” so that all vertices necessary to run those goals will also be run, as well as select a few realization instances (the names of the edges going into packing vertices) that will be run for those goals. However, you must select at least one realization instance per realization vertex or else there will be no complete path to the goal and you won’t get any output.

To create a realization set

Click on the Realizations View in the toolbar
On the left, select 200-head-5 as your goal vertex
Next to that, select default, h2, and h4 as your desired realizations
Click the add button, and give this realization set the name “Everything”
You can now click on the checkmark next to Everything to show which goals and desired realizations it includes

Now you’re ready to generate and run a bash script, just as we did in Preparing to Run a Workflow Synchronously on the Home Machine and Running a Workflow Asynchronously on the Home Machine.

Next: Downloading and Installing on Other Remote Machines (In preparation for Creating a Workflow that Runs on Multiple Machines and/or Schedulers)

Downloading and Installing on Other Remote Machines

LoonyBin distinguishes the design machine (where you define your workflow) from the home machine (the primary machine that executes your workflow, such as Your Favorite Server from other remote machines. A remote machine is reachable

Follow the instructions for Downloading and Installing on the Home Machine, but you don’t need to copy the workflow executor to a remote machine, since the executor only runs on the Home Machine.

You’ll also need to make sure the remote machine can be reached from the home machine via passwordless SSH. If you need help with this, I recommend the tutorial at http://www.debian-administration.org/articles/152. You can either use a blank passphrase if you’re confident of your filesystem’s security or the more complicated ssh-agent solution. It’s up to you. But in the end, you should be able to issue these commands from the home machine without being prompted for a password:

you@homeMachine$ ssh you@remoteMachine
you@remoteMachine$ echo 'Look ma, no password'

Next: Creating a Workflow that Runs on Multiple Machines and/or Schedulers

Creating a Workflow that Runs on Multiple Machines and/or Schedulers

By now you should have seen Creating a Trivial Workflow and Running a Workflow Asynchronously on the Home Machine, and you should have just followed Downloading and Installing on Other Remote Machines...

The Why

Modern scientific computing needs a lot of horsepower. More than most research groups or even small companies can affort on their own. This often leads to getting grants for or renting computing power on academic or commercial clusters. Sometimes even multiple clusters. For instance, you might choose to run one hackish tool that requires you to have root on your local server, run a MapReduce job on a Hadoop cluster with 1000 nodes and 1GB RAM per node, and then run another tool on a cluster with 20 nodes and 128GB RAM per node. Each cluster might even have its own job scheduler, which requires you to submit your jobs to a head node in differnet ways. Yet it’s painful to copy all of these files between machines and clusters while still keeping your workflow organized. Not so with LoonyBin.

The How

In LoonyBin, you can define multiple machine configurations. Each machine configuration can specify 1) a hostname 2) a scheduler and 3) any parameters the scheduler/ssh client needs to know to run the job such as walltime or physical memory required. Currently, LoonyBin supports the Torque scheduler and has some experimental support for Sun Grid Engine (SGE) and Condor. It is possible and common to have multiple machine configurations for a single machine, especially when using a scheduler that requires you to specify how much walltime will be used. Also, when using a scheduler, either your home machine or the remote machine should be a head node of the cluster (a node where you can submit jobs to the scheduler). You can easily write your own submitter by writing some Java code and inheriting from loony.submitters.ChildSubmitter; complaining on the mailing list might also produce results.

We’ll pick up where we left off after creating a trivial workflow

Select Workflow -> Machine Configuration from the menu
Click the Add Machine Config button
Rename the new machine config to “remote”
From the drop-down menu, select “Remote Machine Command Line Submitter” to indicate that tools run under this machine configuration will be submitted directly at the command line of the remote machine (as opposed to through a scheduler)
Enter the hostname and your username for the remote machine that you previously setup
Enter a target working directory, which is an absolute path to a directory on the remote machine where LoonyBin will create directory structure for the vertices run on the remote machine; it’s like the base directory on the home machine except that it will only contain a piece of the full workflow
Type the $PATH_DIRECTORY that you created when installing LoonyBin on the remote machine
Press OK

Now that you’ve created a new machine configuration, you’re ready to apply it to a vertex

Click the Selecting Mode button in the toolbar
Click the 100-head-10 vertex
From the machine config drop-down at the bottom of the right panel, select “remote”

Now you’re ready to generate and run a bash script, just as we did in Preparing to Run a Workflow Synchronously on the Home Machine and Running a Workflow Asynchronously on the Home Machine.

After executing the workflow, you’ll notice that the log files for the 100-head-10 step that ran remotely still got copied back to your base directory on the home machine. Also, there will be a 100-head-10 directory in the base directory, which contains a “list-files” script that records the machine and directory where that vertex was run. For kicks, you can ssh to the remote machine and poke around the target work directory to see what was created there as well; it very closely resembles the base directory on the home machine.

Next: Creating and Using Your Own Tool Descriptor

Creating and Using Your Own Tool Descriptor

New tools are written in the Python programming language by implementing the Tool interface. For simple tools, this means just copying the example template and making 1-2 minutes worth of modifications.

These tools should be placed in a subdirectory of your LoonyBin tool-packs directory. We recommend having a tool pack subdirectory for each genre of project you work on or for each group you wish to share your tools with. That way, each subdirectory can be synchronized and shared via a version control system.

For example, if you’re working with the Machine Translation toolpack, you will have a subdirectory of tool-packs called machine-translation. Tool descriptors you wish to contribute back to the community should be placed in this tool pack. If there are tools you also use internally for your organization and do not wish to release, or you have tools for some other purpose (e.g. Computational Biology), you should place these in a separate tool pack directory.

We will walk through an example of a simple tool, commenting on each section of the file.

Writing the tool descriptor

from loonybin import Tool

class MyTool(Tool):

Just some boilerplate code.

def getName(self):
    return 'Machine Translation/Parallel Corpus/Head'

Gives the tool a name for use in the LoonyBin Workflow Designer GUI. The slash-separated elements are used to create a hierarchical tree structure for selecting tools.

def getDescription(self):
    return "Takes an input parallel corpus and provides the" +
           "first n lines of each side of the corpus as output."

More information for display in the GUI.

def getRequiredPaths(self):
    return ['location-of-head']

Returns a list of strings that correspond to the path files that will be queried for the location of the required paths to execute this tool on the target machine. All files in the required directory will be symlinked into the vertex working directory of this tool so that they can be used by the commands generated in getPreAnalyzers(), getCommands(), and getPostAnalyzers().

def getParamNames(self):
    return [ ('nLines', 'the number of lines that should be taken from the ' +
                           'head of each side of the parallel corpus') ]

Returns a list of pair tuples. The first element of each tuple is the name of the parameter that the workflow designer must provide a value for. The second element of each tuple is a short description that will be used in the GUI.

def getInputNames(self, params):
    return [ ('inFile', 'file to be read') ]

Similar to getParamNames(), returns a list of pair tuples. The params argument is a dictionary (Python-ese for a map or hash table) with the param names specified in getParamNames() as its keys and the user-specified values as its values.

def getOutputNames(self, params):
    return [ ('outFile', 'file to be written') ]

The output counterpart to getInputNames().

def getPreAnalyzers(self, params, inputs):
    return [ 'AnalyzeParCorpus.sh %(fCorpusOut)s %(eCorpusOut)s' % outputs ]

Returns a list of strings that should be executed on the command line, given dictionaries of the params and input files. This example uses standard Python string formatting. For more information on formatting strings in Python visit http://www.python.org/doc/2.5.2/lib/typesseq-strings.html.

def getCommands(self, params, inputs, outputs):
    return [ 'head.sh -n %(nLines)s '%params +
           '< %(fCorpusIn)s '%inputs +
           '> %(fCorpusOut)s'%outputs ]

Returns a list of strings, the command lines that are the “main event” of this tool.

def getPostAnalyzers(self, params, inputs, outputs):
    return [ 'AnalyzeParCorpus.sh %(fCorpusOut)s %(eCorpusOut)s' % outputs ]

The post-command execution counterpart to getPreAnalyzers().

if __name__ == '__main__':
   MyTool().handle()

A bit of boilerplate code that tells the python interpreter to use the handle() method in LoonyBin’s Tool class to read stdin and write stdout to communicate information about this tool to the LoonyBin framework.

Setting up the home machine

Note that you will now need to install the software for this tool descriptor on each home and remote machine that you wish to run it on. In this example, you would need to install the “head.sh” command on the home machine in any directory you choose and then create a path file for it. It might look something like this:

INSTALL_DIR=/some/path
mkdir $INSTALL_DIR
cd $INSTALL_DIR
wget head.sh        # Get the head.sh from some source
pwd > $PATH_DIR/location-of-head.path

Here, PATH_DIR is the LoonyBin path directory for this machine. Notice that location-of-head was defined in the getRequiredPaths() function of the tool descriptor we just created.

(Updated as of V0.5.0)

Creating and Maintaining a Toolpack for Your Organization

TODO... (Discuss how to write a shell script that does a checkout from multiple SVN/git repositories)

TODO: Explain the Inputs, Parameters, Outputs, and Notes views