User Documentation

NEWS: LoonyBin is no longer under active development.
All users are recommended to upgrade to ducttape
https://github.com/jhclark/ducttape

LoonyBin is hosted on sourceforge at http://sourceforge.net/projects/loonybin/

Download it now (V0.6.0 Release Candidate 1 now available -- documentation coming soon).

New to LoonyBin? Consider the Tutorial for V0.5.0.

Summary

User Documentation
What is LoonyBin?	The short answer: A workflow management system with the added notion of “HyperWorkflows.”
Mailing List	You sign up at https://lists.sourceforge.net/lists/listinfo/loonybin-users.
Publications	See http://www.cs.cmu.edu/~jhclark/index.htm#pubs.
FAQ
Is it really safe to use this for my research yet?	Yes.
Do I need to know Java to incorporate my own tools into LoonyBin?	No.
Do I need to know Python to use LoonyBin?	If you want to integrate your own tools into LoonyBin (which is very likely), you will need to know some very basic Python.
What are the requirements for running LoonyBin?	You will need Java 5 installed on the machine that you wish to design workflows on.
Does LoonyBin handle relocation of files and execution on multiple machines?	Yes.
Does LoonyBin support schedulers?	Yes.
Do you support HDFS?	Yes.
Can workflows be resumed after a failure?	Yes.
How do I rerun a vertex?	Delete the appropriate loon log file from the base directory on the home machine before rerunning the workflow bash script.
Is there built-in support for high-level operations?	No.
Getting Started on V0.4.0	New to LoonyBin?
Creating a Basic Workflow Template	First, we will define the programs in our workflow and how their inputs and outputs depend on each other.
Telling your workflow which files to use	We will now define the inputs to our workflow.
Creating New Machine Configurations	Machine Configurations roll multiple aspects of a vertex’s execution into one concept: on what machine the tool will execute, under what directory the directory structure for the step will be created, and what scheduler will be used to submit the job (if any).
How does LoonyBin find the tools installed on each machine?
Generating and executing your workflow as a bash script	LoonyBin can compile the graphic representation of your workflow into an executable Bash script.
Where did the output go? What directory structure does LoonyBin create?
A few words on loon logs	Each loon log is a complete record of how the output at a given workflow vertex was created.
Sanity checking	During the Preanalyzer and Postanalyzer stages, analyzer programs can be run to check the sanity of the data.
Creating a HyperWorkfow, A Not-So-Basic Workflow with Multiple Realizations	LoonyBin supports running a workflow multiple times with variations in each run.
Getting A Bit More Advanced
Navigation	You can zoom in and out of your workflow graphs using your scroll wheel or equivalent.
Deleting Edges and Vertices	To delete an edge or vertex, right-click on it and then select the delete option.
Moving multiple vertices at once	Change to the Selecting mouse mode and then drag a box around the vertices you wish to move.
Parameter Boxes	A parameter box is a special tool that runs no commands, but instead only holds arbitrary parameters.
Future Work
Who Wrote It?	LoonyBin was written by Jonathan Clark (Visit http://www.cs.cmu.edu/~jhclark) to address many of his frustrations with inefficiencies in the way that empirical machine learning research (specifically machine translation research) is conducted.
How does LoonyBin compare to...
Condor / PBS / Sun GridEngine	These are all schedulers intended to find an available machine for your job, run it there and return the results.
DAGMan	Like LoonyBin, DAGMan manages dependencies between jobs.
Pegasus	Like Pegasus, LoonyBin is also a workflow management system.
Dryad	Dryad (like Pegasus) is another workflow management system.

What is LoonyBin?

The short answer: A workflow management system with the added notion of “HyperWorkflows.”

The long answer

A set of scripts for extracting experimental results and other statistics from a series of workflow steps and collecting them into a single, easily-to-manipulate format (plain text: records are newline-delimited and key-value pairs are tab-delimited)
A set of scripts for analyzing said statistics and raising errors (incluing email/SMS notifications) when heuristics indicate that a step failed
A set of scripts for analyzing said data format and generating LaTeX charts, graphs, and reports
A Java GUI application for visually designing workflows (as Directed Acyclic Graphs) and multiple experiments with HyperWorkflows (as Directed Acyclic Hypergraphs), generating BASH scripts that implement these workflows and execute them over multiple computers (while maintaining a neatly organized directory structure)
A Python (Jython) interface for describing the inputs, outputs, and parameters of various Tools, which can serve as steps, to this GUI
A way of encoding complex workflows so that they can be reliably reproduced, even by those who do not have an intimate knowledge of its inner-workings

(Updated as of V0.5.0)

Mailing List

You sign up at https://lists.sourceforge.net/lists/listinfo/loonybin-users.

You can read the archives at https://sourceforge.net/mailarchive/forum.php?forum_name=loonybin-users.

(Updated as of V0.5.0)

Publications

See http://www.cs.cmu.edu/~jhclark/index.htm#pubs.

(Updated as of V0.5.0)

FAQ

Is it really safe to use this for my research yet?

Yes. LoonyBin has been tested by several people now and the author has used it in his research for about a year. However, several testers means only about a dozen; some small bugs may remain. If you find one, send in a report and it will be handled promptly.

(Updated as of V0.5.0)

Do I need to know Java to incorporate my own tools into LoonyBin?

No.

(Updated as of V0.5.0)

Do I need to know Python to use LoonyBin?

If you want to integrate your own tools into LoonyBin (which is very likely), you will need to know some very basic Python. Painfully basic even. You can probably copy and paste from the tutorial and be just fine.

(Updated as of V0.5.0)

What are the requirements for running LoonyBin?

You will need Java 5 installed on the machine that you wish to design workflows on. You will need the Bash shell installed on the machine you wish to execute your compiled workflows on (since workflows are compiled into Bash scripts). If you want to run tools on remote machines, you will also need to have passwordless SSH setup.

You do not need Python installed on the design machine. LoonyBin uses the (included) Jython interpreter for its python tool descriptors.

(Updated as of V0.5.0)

Does LoonyBin handle relocation of files and execution on multiple machines?

Yes. When describing your workflow, you tell LoonyBin which machine (e.g. a local machine or a headnode of a cluster) you want each tool to run on. The workflow is executed from a “base machine” that can contact (via SSH) the other machines involved in the workflow. Input files are automatically retrieved and named so that you don’t have to worry about it (read: so that you don’t waste time misspelling filenames).

(Updated as of V0.5.0)

Does LoonyBin support schedulers?

Yes. LoonyBin supports schedulers via “Machine Configurations.” You can choose for LoonyBin to run each vertex directly at the command line or on a scheduler. Currently, LoonyBin supports the Torque scheduler, which is derived from PBS. It is likely this also works on Sun Grid Engine and Condor, but this functionality is still under testing. If your favorite scheduler isn’t available in LoonyBin, you simply need to implement the ChildSubmitter Java interface. Or complain on the mailing list.

(Updated as of V0.5.0)

Do you support HDFS?

Yes. Since many of the intended audience of LoonyBin (read: impatient MT researchers) uses Apache Hadoop for running MapReduce jobs, it made since to natively support copying files to and from HDFS. Tools may mark inputs and outputs that reside on HDFS by prefixing the input or output name with “hdfs:”. Also, files are efficiently streamed directly from HDFS rather than being written to a local disk first.

(Updated as of V0.5.0)

Can workflows be resumed after a failure?

Yes. If the workflow fails or is interrupted by the user, the next time the workflow is run, it will automatically detect which tool vertices and realizations have already successfully completed by checking which loon log files exist in the base directory on the home machine. If there is any partial output, the directories will be purged before rerunning.

(Updated as of V0.5.0)

How do I rerun a vertex?

Delete the appropriate loon log file from the base directory on the home machine before rerunning the workflow bash script.

(Updated as of V0.5.0)

Is there built-in support for high-level operations?

No. LoonyBin leaves the implementation of all high-level tasks to the user to be defined as separate tools. For instance, “take the k-best-scoring outputs of component X”, “sort the outputs of component Y by some criterion”, or “multiply the corresponding output scores of components A and B” would all be candidates for separate tools that perform these functions.

(Updated as of V0.5.0)

Getting Started on V0.4.0

New to LoonyBin? Consider the Tutorial, which is a bit more up-to-date than the current Getting Started docs.

Creating a Basic Workflow Template

First, we will define the programs in our workflow and how their inputs and outputs depend on each other.

1) Select a tool from the toolbox on the left

2) Change the mouse mode to editing in the top toolbar

3) Click in the center work area to create a tool vertex in your workflow

4) Change the mouse mode to selecting in the top toolbar

5) Use the right panel to give the vertex a name and specify the parameters for the tool. The name of the vertex will be used in the directory structure created when running the workflow and in execution logs to help you trace what happend in the workflow. If you wish to inherit the parameters from another tool, just leave them blank; you will have the option to do this in step 8. Parameter inheritance is especially useful for some of the more common parameters such as machineName, username, tgtWorkDir, and pathDir

A note on vertex names: If the name begins with a string of numbers (e.g. 5000-do-something), the Bash script generator will use these numbers to order the steps. If two vertices have the same dependencies, then the vertex with the lower prefix number will be executed first.

6) Select which machine configuration under which you wish this vertex to run (initially, you will only have the “localhost” machine configuration)

7) Repeat 1-6 to create another tool vertex

8) Click on the first tool vertex and drag to the other tool vertex to create a dependency edge

9) A dialog box will appear presenting the outputs of the first tool and the inputs of the next tool (circles) as well as the parameters of both tools (squares). Drag to specify which output files will become the inputs of the next tool. Also, you can draw dependencies between the parameters to specify inheritance.

We now have a template for our workflow that can take in arbitrary input. It should be obvious that some nodes still require inputs since they are red. If you want a workflow template that isn’t tied to any particular files, now is a good time to save.

(Updated as of V0.4.0)

Telling your workflow which files to use

We will now define the inputs to our workflow.

1) Select the filesystem tool from the toolbox on the left

2) Change mouse mode to editing in the top toolbar

3) Add the filesystem tool to the workflow

4) Create a dependency edge from the filesystem vertex to each vertex that still needs inputs, using the dialog to connect edges as necessary

Now is probably a good time to save a separate copy of your workflow.

(Updated as of V0.4.0)

Creating New Machine Configurations

Machine Configurations roll multiple aspects of a vertex’s execution into one concept: on what machine the tool will execute, under what directory the directory structure for the step will be created, and what scheduler will be used to submit the job (if any).

1) Select “Machine Configurations” from the Workflow menu

(TODO)

How does LoonyBin find the tools installed on each machine?

Path files. At each vertex, you specify a directory where LoonyBin should look for its path files. When a tool is defined in its python descriptor file, a list of required paths is specified (e.g. “jons-scripts”). When a workflow is first executed, LoonyBin checks to make sure all path files and paths exist. First, it looks in the specified path directory on the target execution machine (e.g. “/home/jhclark/paths”) for a path file with the name of the path required by the tool (e.g. “/home/jhclark/paths/jons-scripts.path”). Next, it reads the first (and only) line of the file for the one directory where the required path exists on the target machine. All of the files in that directory are then symlinked into a unique working directory for each workflow vertex along with all required input files.

This allows tools to be installed at different location on different target machines while not requiring binaries to be copied over the wire for each execution. You must create these path files before you can execute your workflow.

(Updated as of V0.4.0)

Generating and executing your workflow as a bash script

LoonyBin can compile the graphic representation of your workflow into an executable Bash script. From the menu, select Workflow -> Generate Bash Script.

LoonyBin allows you to design your workflow on one machine (the design machine) and then execute the generated bash script on another machine such as a server (the home machine). Because of this, you will need to have a copy of the LoonyBin scripts directory on the home machine (obviously, this is already done if the design machine and the home machine are the same).

The dialog you now see will ask you for this path of the LoonyBin scripts !!!!!! on the home machine. Also, you need to tell LoonyBin a base directory on the home machine where log data and pointers to output data generated during workflow execution will be placed. You should also specify the path and name of the bash script that will be generated.

Finally, you can tell give LoonyBin a space-separated list of email addresses to notify when the workflow either fails or succeeds. Most cell phone carriers provide email addresses that forward directly to your phone’s SMS if you prefer to be notified there.

Now just copy the bash generated bash script to the home machine you specified and execute it. All required input files for each step will automatically be transferred to the proper machine before the tool is executed.

(Updated as of V0.4.0)

Where did the output go? What directory structure does LoonyBin create?

You should always start looking for your output in the base directory of the home machine.

Even if executed on a remote machine or in a different working directory, LoonyBin stores a script called ls-step-name in the base directory that will ssh to the right machine and ls the directory where the output was placed.

In general, LoonyBin creates a directory with the vertex name in both the base directory on the home machine and in the target work directory on the target execution machine. Second, below these vertex directories are realization directories with the realization name. If your workflow has only one realization, there will be only one directory called “default”. Otherwise, it will be named for realization it contains (see a Not-So-Basic Workflow). Finally, the third level of directories created are a “working” and a “final” directory. The vertex working directory contains symlinks to all of the files in each of the required path directories specified by the tool and symlinks to all of the input files required by the step. The vertex final directory is created after execution of the tool has completed successfully. It contains symlinks to the output files of the steps and to the .loon log file containing the workflow history.

(Updated as of V0.4.0)

A few words on loon logs

Each loon log is a complete record of how the output at a given workflow vertex was created. It contains all the information from all parent verticies starting from the beginning of the workflow through the tool vertex that was just executed. LoonyBin automatically LoonyBin keeps detailed logs of workflow execution by writing loon log files during the execution of each tool. These logs always include the start time, finish time, elapsed time, time spent copying files, execution hostname, and execution username. Also, tools can add their own information to the logs during a Preanalyzer (run just before tool commands) and Postanalyzer (run just after tool commands) phases.

To remain as lightweight and simple as possible, loon logs are plaintext files containing tab-separated key-value pairs, one per line. This uniform format that allows for easy extraction and formatting of data into tables, graphs, and reports.

TODO: Example log here

(Updated as of V0.4.0)

Sanity checking

During the Preanalyzer and Postanalyzer stages, analyzer programs can be run to check the sanity of the data. If one of these programs returns a non-zero exit code, the workflow execution will be halted and LoonyBin can notify you of the issue.

(Updated as of V0.4.0)

Creating a HyperWorkfow, A Not-So-Basic Workflow with Multiple Realizations

LoonyBin supports running a workflow multiple times with variations in each run. This is useful for conducting parameter sweeps and testing the effect of different tools (or the ordering of tools) on the rest of a workflow. In this example, assume we want to determine the effect of running the workflow ending at tool vertex A versus the workflow ending at tool vertex B.

1) Create an OR vertex using the OR tool

2) Give the vertex a unique name

3) Create a realization edge from tool vertex A to the OR vertex by dragging from the tool vertex to the OR vertex

4) In the dialog box that appears, connect all of the files that will be required by subsequent tools to the OR vertex

5) Create a realization edge from tool vertex B to the OR vertex by dragging from the tool vertex to the OR vertex

6) In the dialog box, match the names of the output files from tool vertex to their counterparts in the OR vertex. There must be the same number of outputs being given to the OR vertex from all tools feeding into the OR vertex

7) Give each of the realization edges names. These will be used in both log files and in the created directory structure to keep the various realizations separate.

8) Create another tool vertex and create a dependency edge from the OR vertex to the new tool vertex.

You will notice that all of the realization names now appear under the new tool vertex. The tool will be run once for each realization using the inputs from each realization edge.

If you wish multiple tool vertices to feed into the same realization, you can give multiple OR vertices and their outgoing realization edges the same name.

Similarly, you can conduct parameter sweeps using multiple Parameter Boxes.

(Updated as of V0.4.0)

Getting A Bit More Advanced

Navigation

You can zoom in and out of your workflow graphs using your scroll wheel or equivalent. Also, you can select the Scrolling mode mode to change what part of the graph you view.

(Updated as of V0.5.0)

Deleting Edges and Vertices

To delete an edge or vertex, right-click on it and then select the delete option.

(Updated as of V0.5.0)

Moving multiple vertices at once

Change to the Selecting mouse mode and then drag a box around the vertices you wish to move. You can then click and drag any selected vertex to move the whole group.

(Updated as of V0.5.0)

Parameter Boxes

A parameter box is a special tool that runs no commands, but instead only holds arbitrary parameters. These are useful for sharing parameters across various tools or conducting parameter sweeps via packing tools.

(Updated as of V0.5.0)

Future Work

Here’s a few features we’re still planning for future releasees

Support for dynamic “HyperEdge Generation” to provide a more flexible alternative to the OR tool
Support for compiling code out of version control systems such as SVN, CVS, Mercurial, and Git
Parallel execution of workflow (using a Java server on the home machine)
Interactive feedback about execution progress
Interactive feedback of logged information
Better scripts for generating charts, tables, and graphs from LoonyBin .loon log files
Dynamic changing of execution ordering of vertices and realizations while the workflow is running
Selecting a subset of possible realizations via a GUI checkbox
Storing subgraphs as tools that can be composed in larger workflows
A sexier UI

(Updated as of V0.4.0)

Who Wrote It?

LoonyBin was written by Jonathan Clark (Visit http://www.cs.cmu.edu/~jhclark) to address many of his frustrations with inefficiencies in the way that empirical machine learning research (specifically machine translation research) is conducted.

The Machine Translation Toolpack for LoonyBin was written by Jonathan Clark, Jonny Weese, Byung Gyu Ahn, Qin Gao, Kenneth Heafield, and Andreas Zollmann.

(Updated as of V0.5.0)

How does LoonyBin compare to...

Condor / PBS / Sun GridEngine

These are all schedulers intended to find an available machine for your job, run it there and return the results. LoonyBin is not concerned with scheduling. LoonyBin can submit vertices to schedulers if desired.

Links: http://www.cs.wisc.edu/condor/, http://www.openpbs.org/, http://gridengine.sunsource.net/

(Updated as of V0.4.0)

DAGMan

Like LoonyBin, DAGMan manages dependencies between jobs. However, DAGMan does not allow multiple realizations of the same workflow (HyperDAGs) nor does it provide the advanced sanity-checking and logging capabilities of LoonyBin.

Link: http://www.cs.wisc.edu/condor/dagman

(Updated as of V0.4.0)

Pegasus

Like Pegasus, LoonyBin is also a workflow management system. However, it is intended for workflows with thousands of nodes being distributed over a national-level scientific computing grid such as TeraGrid. Also, it abstracts a bit farther away from the specifics of the workflow than LoonyBin. Thus, we believe LoonyBin to be more appropriate for the scale of tasks usually found in empirical machine learning research.

Link: http://pegasus.isi.edu

(Updated as of V0.4.0)

Dryad

Dryad (like Pegasus) is another workflow management system. However, Dryad requires Microsoft Windows Server HPC to run and is only available as an academic release. In general, Dryad takes a bit more of a heavyweight approach to solving the workflow problem.

Link: http://research.microsoft.com/en-us/projects/Dryad

(Updated as of V0.4.0)