2015-04-03

Experiment Manager

There is a lot of repetitious work involved in running experiments, not only in the lab but also computationally. Much of the computational drudgery can be minimized using the following approach.

First establish an on disk convention for data organization and management. I organize by year/project/experiment. I assume experiments are of two types, either part of a project, or one-offs. In a drug development environment, projects would be specific therapeutic target e.g. “HER2″ or “IL2″ and would be added to the “data” directory. Under each project would be individual experiments, so you can think of projects as a collection of related experiments.

One-offs, as with technology development or assay optimization experiments, are collected under “misc” (miscellaneous). Experiments are directly under misc with no project container.

--2015
    |--data
    |    |--Project1
    |    |     |--P1Experiment1
    |    |     |--P1Experiment2
    |    |     |--P1Experiment3
    |    |     |--P1Experiment4
    |    |     |--P1Experiment5
    |    |
    |    |--Project2
    |    |     |--P2Experiment1
    |    |     |--P2Experiment2
    |    |--Project3
    |    |--Project4
    |    |--Project5
    |
    |--misc
         |--Experiment1
         |--Experiment2
         |--Experiment3
         |--Experiment4
         |--Experiment5

An experiment, whether in “data” or in “misc” is further subdivided into four directories as illustrated below:

--2015
    |--data
         |--Project1
               |--P1Experiment1
                      |--code
                      |--input
                      |--output
                      |--results

code: scripts used to process data
input: raw data off instruments. Could be numerical, images, sequences or other
output: results of running the code
results: formatted results suitable for notebooks, slides in presentations etc.

--2015
    |--data
         |--Project1
               |--P1Experiment1
                      |--code
                      |    |--process-data.R
                      |
                      |--input
                      |    |--instrument-data.txt
                      |    |--annotation.txt
                      |    |--image1.jpg
                      |    |--image2.jpg
                      |    |--sequences.fasta
                      |
                      |
                      |--output
                      |    |--Proj1Exp1-out.txt
                      |
                      |
                      |--results
                           |--Proj1Exp1Data.xlsx
                           |--Proj1Exp1Slides.ppt

A consistent layout makes it easy to find a specific piece of information. Scripts know to read data from “../input/” and write to “../output”. When searching for processed information, look in results.

Setting up the work environment outlined above can be handled easily by an EMACS-lisp script. Lisp can create the directory structure and then populate with template files that are renamed prior to copying into the destination directory. Lisp can write code directly into the R script. For example you can write code to set the working directory and set a prefix variable used to name files. All this can be attached to a function name that is launched when it is time to set up a new experiment.

To begin, choose a location on disk as I outline above. Starting with a top level directory that is the year helps to further categorize experiments. Create the “data” and “misc” subdirectories. You don’t need to create project directories as that will be managed by the lisp method. I will name the method “create-project”.

(defun create-project ( project exp script )
  (interactive "sEnter project name:  \nsEnter experiment name: \nsEnter script number:  ")
  (if (&gt; (length project) 1)
    (setq working-dir (concat "~/2015/data/" project "/" exp "/"))
    (setq working-dir (concat "~/2015/misc/" exp "/")))
  ...

)

create-project is an interactive method and so must be identified as such with the interactive method call. The interactive method call MUST be the first statement in the create-project method. create-project takes 2 or 3 arguments for which the user will be prompted.

Project name: an optional argument. Enter an existing project to have the experiment added to an existing project directory. Enter a novel project to have a new directory created under the “data” directory. Leave blank to have the experimental directory created unter the “misc” directory i.e. this experiment is not part of a project.
experiment name: enter the experiment name using your personally established naming convention. For example I use my initials followed by the date followed by some desciptive text e.g. PL20150216pcr2. This entry will also be written to the R script to be used as a prefix for all created files.
script number: an integer indicating which script to copy into the “code” directory

create-project not only creates directories but will copy and rename template files inserted into those directories. Next step is to set up the templates. Create a “templates” directory somewhere accessible:

--2015
    |--data
    |--misc
    |--templates
          |--htc
          |   |--htc1.R
          |   |--htc2.R
          |   |--htc3.R
          |
          |--ngs
          |   |--ngs1.R
          |   |--ngs2.R
          |   |--ngs3.R
          |--mut
          |   |--mut1.R
          |   |--mut2.R
          |
          |--proj
              |--template1.xlsx
              |--template2.xlsx
              |--template3.xlsx
              |--template.pptx

My templates includes a powerpoint and excel template that will be renamed with the value of the “exp” variable that was populated by the user when create-project is invoked. I also have a series of R scripts, some useful for high throughput cloning (htc), next gen sequencing, mutagenesis etc. These will be copied into the “code” subdirectory of my experiment. I don’t change the name of the R scripts. I find I like to know what the script will do just by looking at its name.

First create the directories and add the powerpoint file to the results directory. The file is renamed using the value of “exp” queried from the user:


(makunbound 'project)  ;clear the variable for the next run

  (setq template-dir "~/2015/templates/proj/")

  (make-directory working-dir t)
  (make-directory (concat working-dir "input/") t)
  (make-directory (concat working-dir "code/") t)
  (make-directory (concat working-dir "output/") t)
  (make-directory (concat working-dir "results/") t)

  (copy-file (concat template-dir "template.pptx") (concat working-dir "results/" exp ".pptx"))

Next we want to copy over the Excel and R script files, depending on the value of the integer entered by the user. Since there are potentially many (changing) scripts and Excel templates suitable for many different experiments, I print out an association list that I can refer to when creating a new experiment. The list might look like:

cytotoxicity
mutagenesis
ELISA
sequencing
cloning

Depending on what I am doing I enter the appropriate number and the proper scripts/excel files are copied over. The code uses a conditional statement to select amongst the possibilities:


(cond ((equal script "1") (copy-file (concat template-dir "cytotox.xlsx") (concat working-dir "results/" exp ".xlsx"))       
                          (modify-file-with-wkdir (concat template-dir "process-cytotox-prism.R") working-dir "code/" exp) 
			  (save-modified-file working-dir "code/" "process-cytotox-prism.R"))

      ((equal script "2") (copy-file (concat template-dir "template.xlsx") (concat working-dir "results/" exp ".xlsx"))       
                          (modify-file-with-wkdir (concat template-dir "process-envision.R") working-dir "code/" exp) 
			  (save-modified-file working-dir "code/" "process-mutagenesis.R"))

      ((equal script "3") (copy-file (concat template-dir "template.xlsx") (concat working-dir "results/" exp ".xlsx"))       
                          (modify-file-with-wkdir (concat template-dir "process-victor.R") working-dir "code/" exp) 
			  (save-modified-file working-dir "code/" "process-victor.R"))

      ((equal script "4") (copy-file (concat template-dir "template.xlsx") (concat working-dir "results/" exp ".xlsx"))       
                          (modify-file-with-wkdir (concat template-dir "polyreactivity.R") working-dir "code/" exp) 
			  (save-modified-file working-dir "code/" "eval-seqs.R"))

      ((equal script "5") (copy-file (concat template-dir "template.xlsx") (concat working-dir "results/" exp ".xlsx"))       
                          (modify-file-with-wkdir (concat template-dir "process-competition.R") working-dir "code/" exp) 
			  (save-modified-file working-dir "code/" "process-competition.R"))))

(In retrospect, an association list might have been a cleaner option.)

The above code uses a couple of custom methods: modify-file-with-wkdir and save-modified-file. First modify-file-with-wkdir:

(defun modify-file-with-wkdir ( the-file working-dir subdir exp)
  (set-buffer (get-buffer-create "script-buffer"))
  (insert-file-contents the-file)
  (forward-line 11)
  (insert (concat "setwd(\"" working-dir "\")\n"))
  (insert (concat "exp.id &lt;-\"" exp "\"\n")))

Let’s look at the first few lines of one of the template R scripts:

rm(list=ls(all=TRUE))
library(ggplot2)
library(seqinr)



#######################################
##line 10; wrkdir() inserted here



########################################



file.prefix &lt;- "raw"

# Quad pattern
# 1  2
# 3  4


file.name&lt;- paste( getwd(), "/input/", file.prefix, ".txt", sep="")
d &lt;- read.table (file = file.name, n = -1, sep = "\t", dec = ".",   header=TRUE, skip = 0, na.strings = "NA", strip.white = FALSE )

out.file &lt;-paste( getwd(),"/output/",file.prefix, "_results.txt", sep="")

In the region between the hashtags I use modify-file-with-wkdir to write R code that sets the working directory as well as the experiment id prefix. That way when I open the R script in EMACS, I don’t have to set these manually, which would involve looking up the location of the file. After Lisp has modified the file it looks like this:


1
2
3
4
5
6
7
8
#######################################
##line 10; wrkdir() inserted here

setwd("~/2015/data/HER2/MBC150330composite/")
exp.id <-"MBC150330composite"


########################################

The next method just saves the modified file.

(defun save-modified-file (working-dir subdir script-file-name)

  (append-to-file nil nil (concat working-dir subdir script-file-name ))
  (set-buffer-modified-p nil)
  (kill-buffer))

Because create-project was declared interactive, I can use M-x create-project to launch the method. I could even bind to a function key if I have any to spare.
In a future post I will show how to extend this system to manage a multi-step process.