Bagphenotype

This page is under construction. Moreover, this code has undergone incremental evolution similar to that seen in natural species. At each increment it got faster, less complex, more complex, more general but slower, less general but better adapted, etc, all while staying mostly backwards compatible with previous versions. Some of it is ugly. Don't be surprised if there's the odd tailbone in there.

This software is provided free with absolutely no guarantees.

Installation

Installing the supporting R package bagphenotype.library

  1. Download the latest bagphenotype.library.tar.gz file from here
  2. Open a UNIX shell and go to the directory containing the tar.gz file
  3. Type R CMD install bagphenotype.library.tar.gz

Installing the Bagphenotype scripts

  1. Download the latest bagphenotype.tgz file from here and put it in a directory you'd like to run it from dir.
  2. Open a UNIX shell and go to dir
  3. Unzip and unarchive by typing tar -zxf bagphenotype.tgz
  4. Set an environmental variable BAGPHENOTYPE_LIBS to dir/bagphenotype/libs/

Command line options

What the descriptions of arguments mean:
bool_int "0" or "1", meaning false or true, respectively.
dir A directory name
model_string "additive" for the additive genetic model
"full" for the additive + dominance genetic model
"additive,full" or "both" for both additive and full model (fit separately)
"1" for the models specified in the model field of the config file.
number A number

The options and arguments following

	bagphenotype.pl configfile ...
--dryrun bool_int
Print out the command lines invoking the analyses Rscripts, but don't actually run them. This will run --nomenuconfig and create any directories necessary for any other options specified.
--gscan model_string
Do single locus genome scans of the specified genetic models
--memory number
Manage I/O vs memory tradeoff: specify the maximum amount of memory in Mb to be used by bagphenotype for loading probability matrices. For example, if --memory 100 then bagphenotype will load necessary markers into memory until the total R object size exceeds 100Mb. This is only useful (and typically only implemented) for those parts of the analysis where it is expected that the same marker could be loaded from disk multiple times during a Rscript run. It is not useful for when it is expected that each marker will be loaded only once. For example it would not be useful for:
  • Single locus scans (each marker loaded at most once)
  • Permutations or null simulations where a simple linear model is being fitted for the locus effects (optimization of linear model means each marker loaded at most once)
  • Plotting or summarizing (does not use probability matrices)
But it may be useful for:
  • Permutations or null simulations where the linear model optimization is not possible such that 200 null scans amounts to running --gscan 200 times.
  • Multilocus modeling.
  • Positional bootstrapping.
The default value is 0, which means that marker probabilities are loaded from disk every time they are needed.
--nomenuconfig bool_int
Write a "complete" version of the config file to configfile.update in the directory specified by --outdir. If the config file specified a menu.file, the information from the menu file is incorporated in the updated config file. When a non-blank field occurs in both config and menu files, the value in the config file is used. This assumes the menu file to act as a template file containing common options, which the config file fills in or/and overrides.
--null model_string
Do single locus null simulations of the specified genetic models and calculate quantiles. In each null simulation, fake phenotypes are generated by parametric bootstrapping from the fitted null model.
--outdir dir
Specify directory for all output. This is ./ by default.
--peaks model_string
For each specified model, generate a file of high scoring "peak" loci from the single locus scans. The behavior of this option depends on the peak.separation field of the config file.
--perm model_string
Do single locus permutation tests of the specified genetic models and calculate quantiles.
--plot model_string
Plot available single locus scans and significance thresholds.
--plotsummary model_string
Plot available single locus scans, significance thresholds and multilocus scans. This plots in a slightly different format from the --plot option.
--rma model_string
Perform resample model averaging of multilocus models using the specified genetic model.
--scandir dir
Specify directory for output of single locus scans, null scans and permutation scans. This will be subdirectory of the directory specified by --outdir.

Config file options

chromosomes space_separated_list
Consider only these chromosomes. Eg, "chromosomes 1 2 3 X".
genome.cache.dir pathname
Pathname for the HAPPY genome cache
phenotype.dir dir
Directory in which the phenotype file is kept
phenotype string
Name tag given to all files produced from using this config file. Not necessarily the name of the actual phenotype. For example, if you were analyzing the phenotype GlucoseAUC two ways, you might write two different config files GlucoseAUC_first.config with the phenotype filled in as GlucoseAUC_model1, and GlucoseAUC_second.config with the phenotype filled in as GlucoseAUC_model2. Running bagphenotype on both configs in the same directory would yield two separate sets of results files prefixed with GlucoseAUC_model1 and GlucoseAUC_model2 respectively.

FAQ

The items in this FAQ are based on email correspondence and will be gradually incorporated into the main document.

Q: Are the thresholds in the *.thresh.permscan.additive files generated empirically or from the extreme value distribution? If the latter, how does one get from the parameter estimates in pheno.gev.permscan to the threshold values?
A: No, they're calculated from the GEV (of which the EVD is an example). Bagphenotype puts maxima from the permutations in the PERMUTATIONS/ directory. There is one such file for each chromosome and it lists the maximum logP for the chromosome for each permutation. Bagphenotype then reads all these chromosome files, and for each permutation picks the maximum logP across chromosomes. Bagphenotype then fits these to a generalized extreme value distribution using the R library evd (function fgev). This distribution has three parameters, and the estimates of those parameters are written to the .gev.additive file. Bagphenotype then plugs those parameter estimates into the R function qgev to get the upper 0.05 quantile of that fitted distribution (and the 0.2 quantile, or whatever is asked for in the config file). All the information you need to calculate your own thresholds is in the .gev.additive file, provided you're happy with using the qgev() function.