Setting Up a Neuroscience HTC Workload 2: setupJobs and stub.yaml

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How does setupJobs turn a stub.yaml into a populated directory tree?

Objectives

Understand how to use setupJobs with a YAML file to setup HTC workloads.

Overview

Now that you have worked through one preset example, the goal of this section is to show how InputSetup and specifically setupJobs can be used to facilitate any arbitrary HTC analysis.

The simplest example: No special fields

The stub.yaml we worked with previously contained many parameters that were specific to the analysis we wanted to run and the WISC MVPA workflow. However, the YAML + setupJobs workflow is not specific to WISC MVPA or those parameters. Let’s boil this down and consider the simplest example.

# simple-stub.yaml
A: 1
B: 2
C: [1,2]
D: [3,4,5]

In this file, I define 4 parameters. Two of them are scalar values, and 2 are lists.

Let’s run setupJobs.

$ ~/InputSetup/setupJobs stub.yaml
$ find ./ -type f -name "params.json" -print

./0/params.json

{
    "A": 1,
    "B": 2,
    "C": [ 1, 2 ],
    "D": [ 3, 4, 5 ],
    "SearchWithHyperband": false
}

Nothing interesting here: we basically translated YAML to JSON and added one field that we can ignore.

The EXPAND field

Now, let’s modify simple-stub.yaml slightly:

# simple-stub.yaml
A: 1
B: 2
C: [1,2]
D: [3,4,5]
EXPAND:
    - C

EXPAND is a special field. Rather than both elements in C going into a single job as before, two parameter files will be written. In the first, C will be set to 1, and in the second C will be set to 2.

$ ~/InputSetup/setupJobs stub.yaml
$ find ./ -type f -name "params.json" -print | xargs grep "C\|D"

./0/params.json:    "C": 1,
./0/params.json:    "D": [3,4,5],
./1/params.json:    "C": 2,
./1/params.json:    "D": [3,4,5],

Now, what would happen if we listed both C and D under expand? Let’s run setupJobs again and just see how it goes:

# simple-stub.yaml
A: 1
B: 2
C: [1,2]
D: [3,4,5]
EXPAND:
    - C
    - D

$ ~/InputSetup/setupJobs stub.yaml
$ find ./ -type f -name "params.json" -print | xargs grep "C\|D"

./0/params.json:    "C": 1,
./0/params.json:    "D": 3,
./1/params.json:    "C": 2,
./1/params.json:    "D": 3,
./2/params.json:    "C": 1,
./2/params.json:    "D": 4,
./3/params.json:    "C": 2,
./3/params.json:    "D": 4,
./4/params.json:    "C": 1,
./4/params.json:    "D": 5,
./5/params.json:    "C": 2,
./5/params.json:    "D": 5,

It “crossed” the lists in C and D, creating one parameter file where C:1 and D:3, another where C:1 and D:4, and so on. EXPAND can cross as many fields as you like. In this way, a single stub file can contain the instructions for building many parameter files.

The COPY field

There are two additional special fields that help with organizing a workload and managing file transfers. First, let’s intoduce COPY. Try the following at the bash shell:

$ touch cows.txt ducks.txt pigs.txt

And now we will again slightly revise simple-stub.yaml:

# simple-stub.yaml
A: 'cows.txt'
B: 2
C: ['ducks.txt','pigs.txt']
D: [3,4,5]
EXPAND:
    - C
    - D
COPY:
    - A
    - C

$ ~/InputSetup/setupJobs stub.yaml
$ find . -mindepth 2 -type f -name "cows.txt" -print

./shared/cows.txt

$ find . -mindepth 2 -type f -name "ducks.txt" -print

./0/ducks.txt
./2/ducks.txt
./4/ducks.txt

$ find . -mindepth 2 -type f -name "pigs.txt" -print

./1/pigs.txt
./3/pigs.txt
./5/pigs.txt

Notice that ducks.txt and pigs.txt are copied into different individual directories, while cows.txt is copied only into the shared directory. Since A is constant over jobs, that means every job needs cows.txt. The whole shared folder is sent to every job, so there is no reason to make 6 copies of cows.txt.

The URLS field

The final special field is URLS. This field controls what gets put into the querey_input.csv file. You’ll notice that up until this point, this file was not being generated. Once this field is included in simple-stub.yaml, it will be.

# simple-stub.yaml
A: 'cows.txt'
B: 'http://host.org/metadata.mat'
C: ['ducks.txt','pigs.txt']
D: ['http://host.org/data1.mat','http://host.org/data2.mat','http://host.org/data3.mat']
EXPAND:
    - C
    - D
COPY:
    - A
    - C
URLS:
    - B
    - D

$ ~/InputSetup/setupJobs stub.yaml
$ cat queue_input.csv

./0,http://host.org/metadata.mat,http://host.org/data1.mat
./1,http://host.org/metadata.mat,http://host.org/data1.mat
./2,http://host.org/metadata.mat,http://host.org/data2.mat
./3,http://host.org/metadata.mat,http://host.org/data2.mat
./4,http://host.org/metadata.mat,http://host.org/data3.mat
./5,http://host.org/metadata.mat,http://host.org/data3.mat

Conclusion

The combination of setupJobs and the YAML file is a flexible and efficient way to compose multiple jobs by specifying a single file. Let’s now open the stub.yaml file from the previous step. It should now be a bit more clear how we ended up with 400 jobs by parsing that single file.

# LASSO
# =====
# Parameters
# ----------
regularization: lasso
bias: 0
lamSOS: 0
lamL1:
- [0.16687552227835278, 0.020069366363865006, 0.18605028276377383, 0.1160671419239185,
  0.05027903676541319, 0.05301263021328249, 0.08898321480436269, 0.19149487996440576,
  0.03863616962062233, 0.19844605125195458, 0.038222774885923606, 0.01134860758246039,
  0.07207951935171564, 0.1569288549227326, 0.08437103366668688, 0.11532100132517042,
  0.014342930023995604, 0.02219483305873733, 0.19340350195403588, 0.049748251283921574,
  0.007778915353824867, 0.05497631187383272, 0.12640523573331897, 0.03180902389766489,
  0.0087411996985588, 0.022514732785065862, 0.034995585070114334]
- [0.18421609026156463, 0.033616575166403484, 0.13755222238717435, 0.009691174124414337,
  0.14285117261650526, 0.15769277613300337, 0.0981156438116499, 0.017188754085093173,
  0.032335241791300984, 0.15654268929018636, 0.17858265979155477, 0.10840231512873322]
- [0.08507121250050183, 0.05178006666804946, 0.051063662121274604, 0.1150261844640534,
  0.11917568063331192, 0.044331415903182014]
- [0.16613800107047444, 0.10872062155741498, 0.0772650783211931, 0.007407881025943853]
normalize_data: zscore
normalize_target: none
normalize_wrt: training_set

# HYPERBAND
# =========
BRACKETS:
- n: [27, 9, 3, 1]
  r: [1, 3, 9, 27]
  s: 3
- n: [12, 4, 1]
  r: [5, 15, 45]
  s: 2
- n: [6, 2]
  r: [16, 48]
  s: 1
- n: [4]
  r: [50]
  s: 0
SearchWithHyperband: true

# Data and Metadata
# =================
data:
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp01.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp02.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp03.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp04.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp05.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp06.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp07.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp08.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp09.mat"
  - "http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/jlp10.mat"
data_var: X
metadata: http://proxy.chtc.wisc.edu/SQUID/crcox/MRI/CogSci2018/metadata.mat
metadata_var: metadata


# Metadata Field References
# =========================
cvscheme: 1
cvholdout: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
finalholdout: 0
filters: [rowfilter, colfilter]

# Targets
# -------
target_label: faces
target_type: category

# Coordinates
# -----------
orientation: tlrc

# WISC_MVPA Options
# =================
subject_id_fmt: jlp%02d.mat
executable: WISC_MVPA
wrapper: run_WISC_MVPA_OSG.sh

# setupJob Special Fields
# =======================
EXPAND:
  - data
  - [cvholdout]
  - [BRACKETS, lamL1]
COPY: [executable, wrapper]
URLS: [data, metadata]

Addendum

There are some features of setupJobs that were not reviewed today that my be of interest. One is that there is special functionality for constructing a hyperparameter search using the Hyperband procedure (paper). The usage is exemplified in the stub.yaml file in the previous step (which is what `stub_hb.yaml was generated from), and will very soon be documented at github.com/crcox/InputSetup. Feel free to direct questions to Chris Cox in the meantime.

Key Points

The setupJobs workflow is not project specific. The parameters listed in the stub file can be tailored to any needs.

EXPAND, COPY, and URL fields in stub.yaml define 3 operations that can be applied to (combinations of) parameter fields listed elsewhere in the YAML file.

previous episode

Neuroscience HTC on the Open Science Grid

lesson home

Setting Up a Neuroscience HTC Workload 2: setupJobs and stub.yaml

Overview

Overview

The simplest example: No special fields

The EXPAND field

The COPY field

The URLS field

Conclusion

Addendum

Key Points

previous episode

lesson home