This tutorial is designed to teach you about 90% of what you need to know about LoonyBin as quickly as possible. You should pick and choose the topics that best fit your interests. Also, you might want to consider printing this tutorial unless you have a dual monitor setup since you might be flipping back and forth fairly often. If you run into any problems, please let us know on the Mailing List.
In general, I find that using the visual designer for initially setting up complex workflows is much more efficient and results in fewer design errors. And I normally don’t like GUIs. The text-based workflow format and console interface is then the best way to go when small changes need to be made during development.
WARNING: This tutorial is currently in flux in preparation for the V0.6.0 release. You have been warned.
Tutorial | This tutorial is designed to teach you about 90% of what you need to know about LoonyBin as quickly as possible. |
Downloading and Installing on the Design Machine | This information is also available in the YouTube video: http://www.youtube.com/watch?v=HEm4Mj72LDM |
Compiling and Running a Workflow in a Single Command on the Home Machine | Okay, almost a single command... |
Opening and Understanding an Example Workflow | After Downloading and Installing on the Design Machine... |
Running a Workflow Synchronously on the Design Machine | This is perhaps the fastest way of seeing if you’re interested in LoonyBin, but not something you’ll probably want to do with it day-to-day. |
Understanding Workflow Output | Okay, so it worked! |
Downloading and Installing on the Home Machine | This section explains how to get the Home Machine (the machine where you will actually execute the generated script) ready for running the example. |
Preparing to Run a Workflow Synchronously on the Home Machine | The following information is also available in the YouTube video: http://www.youtube.com/watch?v=_akylEUCIQU |
Running a Workflow Synchronously on the Home Machine | Though typically you will probably want to run workflows asynchronously (all vertices with their dependencies satisfied will be run in parallel), there are some situations where you might want to run the workflow synchronously (one vertex at a time). |
Running a Workflow Asynchronously on the Home Machine | For a comparison of synchronous vs asynchronous workflows, read the introduction to Running a Workflow Synchronously on the Home Machine. |
Creating a Trivial Workflow | In this section, we will recreate the example workflow that we’ve been using earlier in the tutorial. |
Creating a Trivial HyperWorkflow | You should have already looked at Running a Workflow Synchronously on the Home Machine or Running a Workflow Asynchronously on the Home Machine. |
Dealing with Lots of Realizations | By now, you should have seen Creating a Trivial HyperWorkflow, or else you won’t have many realizations to work with. |
Downloading and Installing on Other Remote Machines | LoonyBin distinguishes the design machine (where you define your workflow) from the home machine (the primary machine that executes your workflow, such as Your Favorite Server from other remote machines. |
Creating a Workflow that Runs on Multiple Machines and/ | By now you should have seen Creating a Trivial Workflow and Running a Workflow Asynchronously on the Home Machine, and you should have just followed Downloading and Installing on Other Remote Machines... |
Creating and Using Your Own Tool Descriptor | New tools are written in the Python programming language by implementing the Tool interface. |
Creating and Maintaining a Toolpack for Your Organization | TODO... |
This information is also available in the YouTube video: http://www.youtube.com/watch?v=HEm4Mj72LDM
We will be installing LoonyBin on the Design Machine, the machine where you will design workflows (e.g. a personal laptop or a workstation).
(Updated as of V0.5.0)
Okay, almost a single command...
# Go to wherever you installed LoonyBin # (e.g. /Applications/LoonyBin) $ cd $LOONYBIN # Edit the example hyperworkflow to reflect your preferred directory structure emacs tool-packs/example/workflows/example.hwork
# Compile and run the workflow loon tool-packs/example/workflows/example.hwork
If you want to see what’s going on in a nifty web interface on port 4242 and keep that web UI running even after the workflow completes, you can use:
loon tool-packs/example/workflows/example.hwork --launchUI 4242 --runForever
(Updated as of V0.6.0)
After Downloading and Installing on the Design Machine...
You might have noticed that each of these vertices’ names begin with a number. This is just to make the final output on the file system easy to view in chronological order. The numbering scheme is similar to BASIC; this is only a preference of the author and LoonyBin does not prevent you from using whatever convention you like.
Next: Running a Workflow Synchronously on the Design Machine
This is perhaps the fastest way of seeing if you’re interested in LoonyBin, but not something you’ll probably want to do with it day-to-day. In this scenario, the design machine will also be used as the home machine, where the workflow actually executes. However, since LoonyBin requires that the home machine have some flavor of UNIX installed, it only works on Mac and Linux. Sorry Windows users, you’ll need to go to Running a Workflow Synchronously on the Home Machine to give LoonyBin a try.
LoonyBin dynamically determines where the necessary binaries are for each tool in a workflow. This allows the binaries to live at different paths on each machine your might execute your workflow on. The only tools LoonyBin needs for this example workflow are its helper scripts, which are already installed on the Design Machine at $LOONYBIN/scripts. The generated script will look for this path inside a file called $PATH_DIRECTORY/loonybin-scripts.path, so we’ll create this file using a UNIX shell:
# Go to wherever you installed LoonyBin # (e.g. /Applications/LoonyBin) $ cd $LOONYBIN # Make a $PATH_DIRECTORY $ mkdir paths $ cd scripts # Put the current path into the loonybin-scripts.path file $ pwd > ../paths/loonybin-scripts.path
$ cd ../paths # Take a look at the path file $ cat loonybin-scripts.path # Remember your path directory, you'll need it later! $ pwd
You only need to do this once per machine your install LoonyBin on.
For more information on path files, see How does LoonyBin find the tools installed on each machine?.
Okay, so it worked! ...what did it do? To see, go to the “base directory” you specified when generating the script.
First, it created an output directory for each vertex. And a “default” realization directory under each vertex directory since there was only one realization in our workflow. Using your trusty shell, have a look at:
ls -lh FS-input-files/default/final/example.txt
which is a symlink to the input file you specified. Similarly, you can find the output of the next 2 vertices with:
less 100-head-10/default/final/corpusOut less 200-head-5/default/final/corpusOut
They should contain the first 10 and first 5 lines of the previous file. Next to the “final” directory, you will also see a “working” directory, which contains all of the intermediate files the vertex produced while running. There is also an “inputs” and “outputs” directory under “working.”
For more info, see Where did the output go? What directory structure does LoonyBin create?.
LoonyBin also produced log files for each step. Each log file contains the log files of all previous vertices as well, since determining a vertex’s dependencies can be non-trivial in HyperWorkflows. For example, have a look at the last loon log:
less 200-head-5-default.loon
Notice that the log file contains information about what machine the vertex ran on, any parameters of the vertex, how long it took the vertex to run, and any logging information from pre/post-analyzers (discussed later).
For more info, see A few words on loon logs.
This section explains how to get the Home Machine (the machine where you will actually execute the generated script) ready for running the example. This section assumes you have already followed Downloading and Installing on the Design Machine.
scp -R $LOONYBIN/scripts user@my.machine:~/loonybin/scripts
ssh $HOME_MACHINE
# $SCRIPTS is the directory where you # copied the scripts (e.g. ~/loonybin/scripts)
chmod +x $SCRIPTS/*
# Remember the location of this $PATH_DIRECTORY for later
mkdir -p ~/myPathDirectory cd $SCRIPTS pwd > ~/myPathDirectory/loonybin-scripts.path
loonybin-scripts.path is a special file that loonybin will look in to determine the location of its helper scripts on each machine.
For more information on path files, see How does LoonyBin find the tools installed on each machine?.
For non-trivial workflows (such as those in the Machine Translation Toolpack), you would need to install other dependencies on the home machine as well (For example, using the “install-dependencies.py” included in the MT Toolpack). Doing this yourself is covered in more detail in Creating and Using Your Own Tool Descriptor.
(If you won’t be invoking generated workflow scripts directly from this machine -- i.e. this is a “Remote Machine” -- you can stop here)
Next: Preparing to Run a Workflow Synchronously on the Home Machine
The following information is also available in the YouTube video: http://www.youtube.com/watch?v=_akylEUCIQU
After following Opening and Understanding an Example Workflow and Downloading and Installing on the Home Machine...
cd $LOONYBIN scp -R tool-packs/example/example-input user@myMachine.org:/usr13/jhclark/example-input
You might also want to open example.txt to see for yourself what’s inside (line numbers)
Next: Running a Workflow Synchronously on the Home Machine or Running a Workflow Asynchronously on the Home Machine
Though typically you will probably want to run workflows asynchronously (all vertices with their dependencies satisfied will be run in parallel), there are some situations where you might want to run the workflow synchronously (one vertex at a time). First, a workflow can be run synchronously from a single monolithic generated bash script whereas asynchronous workflows are a zipped collection of shell scripts and meta-data that must be run by the LoonyBin workflow executor. Since asynchronous workflows are run using the executor, they also have the added benefit of the web UI, which is a part of the executor, that displays the current status of the workflow and allows easy access to its output.
After following Preparing to Run a Workflow Synchronously on the Home Machine...
cd /directory/where/i/put/the/generated/script scp example.sh user@myMachine.org:~
ssh user@myMachine.org cd ~
bash example.sh -run
For a comparison of synchronous vs asynchronous workflows, read the introduction to Running a Workflow Synchronously on the Home Machine.
After following Preparing to Run a Workflow Synchronously on the Home Machine...
NOTE: To run the asynchronous version, we will use the generated .async file instead of the .sh file.
cd /directory/where/i/put/the/generated/script scp example.async user@myMachine.org:~
ssh user@myMachine.org cd ~
$LOONYBIN/LoonyBinWorkflowExecutor.sh example.async 4242
In this section, we will recreate the example workflow that we’ve been using earlier in the tutorial.
The filesystem vertex is also one point of integration with HDFS (the Hadoop Distributed File System). You can tell LoonyBin that it should look for a file on HDFS instead of on the UNIX filesystem by just prefixing the input name with “hdfs:” -- For example, if example.txt were located on HDFS, we could write “hdfs:example.txt” and then just put its path on HDFS in the path box to its right.
You should be seeing some strong parallels with Opening and Understanding an Example Workflow by this point.
You should have already looked at Running a Workflow Synchronously on the Home Machine or Running a Workflow Asynchronously on the Home Machine. Now we’ll pick up where we left off after Creating a Trivial Workflow.
In the context of research, a workflow typically encodes a set of steps under a fixed set of conditions that leads to a single experimental result (e.g. a vector of 3 numbers). However, good empirical research should always include at least a control group or a baseline system that represents the current state-of-the-art, and usually there is not a fixed set of experimental conditions, but many different tools, parameters, and hyperparameters are tried before settling on a result. This is where HyperWorkflows become useful. They encode multiple experimental conditions while not requiring you to duplicate shared paths in your definition of the workflow nor while the workflow executes.
In LoonyBin, HyperWorkflows are just workflows that contain packing vertices (a.k.a. OR vertices), which act like switches (or multiplexers, for the electrical engineers among you) between their incoming edges. We call a path through a workflow in which we have selected one named input for each packing vertex a realization. A HyperWorkflow encodes many paths (a.k.a. realizations) through the workflow in a compact form (like forests or hypergraphs, for the natural language processing people among you). When there are multiple packing vertices in a row, LoonyBin takes the cross-product between them will produce the result for all combinations of these different conditions. You also have the option of restricting the cross-product to some set of experiments you care about by using the Realization Selection mechanism, described in Dealing with Lots of Realizations.
In the past, workflow management systems have dealt with this issue by having the user (that’s you), copy their entire configuration file for the workflow, change a variable or two and rerun. The best of these systems can automatically determine what dependencies were run in previous experiments and the worst rerun the entire workflow again. However, they all suffer from the problem of clone-and-modify programming (ridiculed as “inheritance” in Abject-Oriented Programming: http://typicalprogrammer.com/?p=8). If a bug was in the original configuration, it is now in all of its clones, too. Good luck finding them all. (AND all of the bug-ridden files they created!)
Going back to our trivial example, let’s say we want to test the effect of head-5 on 3 inputs, one with head-10 of example.txt, one with head-2 of example.txt, and one with head-4 of example.txt.
At this point, you have exactly recreated the trivial workflow from before. Because the incoming edge of the packing vertex has the special name “default” it will not affect the output directory structure of the workflow. Now let’s the other two realizations so that we get more results:
Your HyperWorkflow is now complete. To generate a script that implements some or all of the realizations it contains, you will need to create more Realization Selections, which we will discuss in the next section of the tutorial.
After you execute this workflow, you will notice that the 200-head-5 vertex directory under the base directory now contains multiple subdirectories, one for each realization. Also, if that vertex had child vertices, they would also contain a subdirectory for each realization.
By now, you should have seen Creating a Trivial HyperWorkflow, or else you won’t have many realizations to work with.
If a workflow contains 3 packing vertices with 4 input names each, this is 4 x 4 x 4 = 64 realizations for the final vertex. If that vertex takes a day to run, running all possible experiments may be unacceptable.
Realization Sets are the way that LoonyBin handles the exponential explosion in the number of realizations that you get when introducing multiple packing vertices into a workflow. A realization set allows you to select a few vertices as a “goals” so that all vertices necessary to run those goals will also be run, as well as select a few realization instances (the names of the edges going into packing vertices) that will be run for those goals. However, you must select at least one realization instance per realization vertex or else there will be no complete path to the goal and you won’t get any output.
Now you’re ready to generate and run a bash script, just as we did in Preparing to Run a Workflow Synchronously on the Home Machine and Running a Workflow Asynchronously on the Home Machine.
Next: Downloading and Installing on Other Remote Machines (In preparation for Creating a Workflow that Runs on Multiple Machines and/or Schedulers)
LoonyBin distinguishes the design machine (where you define your workflow) from the home machine (the primary machine that executes your workflow, such as Your Favorite Server from other remote machines. A remote machine is reachable
Follow the instructions for Downloading and Installing on the Home Machine, but you don’t need to copy the workflow executor to a remote machine, since the executor only runs on the Home Machine.
You’ll also need to make sure the remote machine can be reached from the home machine via passwordless SSH. If you need help with this, I recommend the tutorial at http://www.debian-administration.org/articles/152. You can either use a blank passphrase if you’re confident of your filesystem’s security or the more complicated ssh-agent solution. It’s up to you. But in the end, you should be able to issue these commands from the home machine without being prompted for a password:
you@homeMachine$ ssh you@remoteMachine you@remoteMachine$ echo 'Look ma, no password'
Next: Creating a Workflow that Runs on Multiple Machines and/or Schedulers
By now you should have seen Creating a Trivial Workflow and Running a Workflow Asynchronously on the Home Machine, and you should have just followed Downloading and Installing on Other Remote Machines...
Modern scientific computing needs a lot of horsepower. More than most research groups or even small companies can affort on their own. This often leads to getting grants for or renting computing power on academic or commercial clusters. Sometimes even multiple clusters. For instance, you might choose to run one hackish tool that requires you to have root on your local server, run a MapReduce job on a Hadoop cluster with 1000 nodes and 1GB RAM per node, and then run another tool on a cluster with 20 nodes and 128GB RAM per node. Each cluster might even have its own job scheduler, which requires you to submit your jobs to a head node in differnet ways. Yet it’s painful to copy all of these files between machines and clusters while still keeping your workflow organized. Not so with LoonyBin.
In LoonyBin, you can define multiple machine configurations. Each machine configuration can specify 1) a hostname 2) a scheduler and 3) any parameters the scheduler/ssh client needs to know to run the job such as walltime or physical memory required. Currently, LoonyBin supports the Torque scheduler and has some experimental support for Sun Grid Engine (SGE) and Condor. It is possible and common to have multiple machine configurations for a single machine, especially when using a scheduler that requires you to specify how much walltime will be used. Also, when using a scheduler, either your home machine or the remote machine should be a head node of the cluster (a node where you can submit jobs to the scheduler). You can easily write your own submitter by writing some Java code and inheriting from loony.submitters.ChildSubmitter; complaining on the mailing list might also produce results.
Now you’re ready to generate and run a bash script, just as we did in Preparing to Run a Workflow Synchronously on the Home Machine and Running a Workflow Asynchronously on the Home Machine.
After executing the workflow, you’ll notice that the log files for the 100-head-10 step that ran remotely still got copied back to your base directory on the home machine. Also, there will be a 100-head-10 directory in the base directory, which contains a “list-files” script that records the machine and directory where that vertex was run. For kicks, you can ssh to the remote machine and poke around the target work directory to see what was created there as well; it very closely resembles the base directory on the home machine.
New tools are written in the Python programming language by implementing the Tool interface. For simple tools, this means just copying the example template and making 1-2 minutes worth of modifications.
These tools should be placed in a subdirectory of your LoonyBin tool-packs directory. We recommend having a tool pack subdirectory for each genre of project you work on or for each group you wish to share your tools with. That way, each subdirectory can be synchronized and shared via a version control system.
For example, if you’re working with the Machine Translation toolpack, you will have a subdirectory of tool-packs called machine-translation. Tool descriptors you wish to contribute back to the community should be placed in this tool pack. If there are tools you also use internally for your organization and do not wish to release, or you have tools for some other purpose (e.g. Computational Biology), you should place these in a separate tool pack directory.
We will walk through an example of a simple tool, commenting on each section of the file.
from loonybin import Tool class MyTool(Tool):
Just some boilerplate code.
def getName(self): return 'Machine Translation/Parallel Corpus/Head'
Gives the tool a name for use in the LoonyBin Workflow Designer GUI. The slash-separated elements are used to create a hierarchical tree structure for selecting tools.
def getDescription(self): return "Takes an input parallel corpus and provides the" + "first n lines of each side of the corpus as output."
More information for display in the GUI.
def getRequiredPaths(self): return ['location-of-head']
Returns a list of strings that correspond to the path files that will be queried for the location of the required paths to execute this tool on the target machine. All files in the required directory will be symlinked into the vertex working directory of this tool so that they can be used by the commands generated in getPreAnalyzers(), getCommands(), and getPostAnalyzers().
def getParamNames(self): return [ ('nLines', 'the number of lines that should be taken from the ' + 'head of each side of the parallel corpus') ]
Returns a list of pair tuples. The first element of each tuple is the name of the parameter that the workflow designer must provide a value for. The second element of each tuple is a short description that will be used in the GUI.
def getInputNames(self, params): return [ ('inFile', 'file to be read') ]
Similar to getParamNames(), returns a list of pair tuples. The params argument is a dictionary (Python-ese for a map or hash table) with the param names specified in getParamNames() as its keys and the user-specified values as its values.
def getOutputNames(self, params): return [ ('outFile', 'file to be written') ]
The output counterpart to getInputNames().
def getPreAnalyzers(self, params, inputs): return [ 'AnalyzeParCorpus.sh %(fCorpusOut)s %(eCorpusOut)s' % outputs ]
Returns a list of strings that should be executed on the command line, given dictionaries of the params and input files. This example uses standard Python string formatting. For more information on formatting strings in Python visit http://www.python.org
def getCommands(self, params, inputs, outputs): return [ 'head.sh -n %(nLines)s '%params + '< %(fCorpusIn)s '%inputs + '> %(fCorpusOut)s'%outputs ]
Returns a list of strings, the command lines that are the “main event” of this tool.
def getPostAnalyzers(self, params, inputs, outputs): return [ 'AnalyzeParCorpus.sh %(fCorpusOut)s %(eCorpusOut)s' % outputs ]
The post-command execution counterpart to getPreAnalyzers().
if __name__ == '__main__': MyTool().handle()
A bit of boilerplate code that tells the python interpreter to use the handle() method in LoonyBin’s Tool class to read stdin and write stdout to communicate information about this tool to the LoonyBin framework.
Note that you will now need to install the software for this tool descriptor on each home and remote machine that you wish to run it on. In this example, you would need to install the “head.sh” command on the home machine in any directory you choose and then create a path file for it. It might look something like this:
INSTALL_DIR=/some/path mkdir $INSTALL_DIR cd $INSTALL_DIR wget head.sh # Get the head.sh from some source pwd > $PATH_DIR/location-of-head.path
Here, PATH_DIR is the LoonyBin path directory for this machine. Notice that location-of-head was defined in the getRequiredPaths() function of the tool descriptor we just created.
(Updated as of V0.5.0)