Bagpipe: genomewide LD mapping

Bagpipe is a program for performing genomewide linkage disequilibrium mapping of quantitative trait loci in populations whose genome structure can be accommodated in the HAPPY framework [Mott00]. This includes most diploid crosses where the founders of the individuals have known genotypes.

About

Quick facts about Bagpipe:

  • Bagpipe is a simplified and streamlined version of Bagphenotype that does not currently include resample model averaging (RMA) capabilities.
  • Bagpipe can help fit single locus regression models (with or without random effects) to marker intervals whose genetic ancestry is inferred using the HAPPY software.
  • Bagpipe cannot help you decide what is a sensible model to fit.
  • Bagpipe does not currently accommodate populations with significant population structure, except through the specification of simple random intercepts based on unpatterned covariance matrices.
  • Bagpipe is named after the Scottish wind instrument “the bagpipes” and after Bagphenotype, which in turn was a PIPEline for BAGging-based multiple QTL analysis of phenoTYPEs. Bagphenotype was in turn based on software written by Richard Mott and William Valdar to analyze heterogeneous stock mice in Valdar06.
  • Bagpipe is experimental software, is provided free of charge subject to copyleft restrictions, and comes with no guarantees whatsoever.

For constructive comments and suggestions email the author William Valdar. The model and optimizations implemented in bagpipe are mostly described in the single locus sections and appendix of Valdar09. The basic HAPPY regression model was introduced by Mott00.

Current limitations

Bagpipe currently only implements mapping using a standard fixed effects linear model and assessment of thresholds by unrestricted permutation. The author is incorporating random effects and generalized linear modeling and other features that were present in Bagphenotype.

Current bugs

2013-12-10: Bagpipe is currently incompatible with the new version of lme4 (1.0 or greater). It should still work under pre-1.0 lme4. This bug is currently being fixed.

Updates

Last webpage update 2015-08-14 at 16:49. Major updates are listed below:

2013-02-04: Major webpage update, using the Sphinx document generator.

2012-02-08: bagpipe.bundle renamed to bagpipe.backend to avoid confusion with *.bundle files on Mac OS X.

2011-06-19: experimental functions for positional bootstrap and for saving model fits at a few specified loci.

2011-05-29: due to the authors recent improved understanding of R packages, the bagpipe program is now supported by a single R package rather than two.

2011-05-28: strain-merging, ie, collapsing \(J\) haplotypes into \(Q < J\) groups, is now implemented in the config file formulas.

Installation

R/bagpipe.backend

The R package R/bagpipe.backend contains the core R functions for fitting models in bagpipe.

  1. Download the latest bagpipe.backend.tar.gz
  2. Open a UNIX shell and go to the directory containing this file
  3. Type R CMD install --clean --no-docs followed by the name of the file

Note: bagpipe.backend was previously called bagpipe.bundle. The name was changed because *.bundle files have a special interpretation on Mac OS X.

bagpipe.pl

Bagpipe is organized as a central perl script that calls a number of R scripts for specific tasks and that relies upon several perl modules. All these components are provided in a single download.

  1. Download the latest bagpipe.tar.gz and put it in a directory you would like to run it from, eg, ~/bin/bagpipe/.
  2. Open a UNIX shell and go to ~/bin/bagpipe/.
  3. Unzip and unarchive by typing tar -zxf followed by the filename
  4. Set an environmental variable BAGPIPE_LIBS that will be interpreted by the shell as ~/bin/bagpipe/. For example, if you are using the bash shell then in your .bash_profile add the line export BAGPIPE_LIBS=~/bin/bagpipe/lib/.
  5. Set an alias to the bagpipe program so you can run it from anywhere. For example, if you are using the bash shell then in your .bash_profile add the line alias bagpipe=~/bin/bagpipe/bagpipe.pl.

Stages of Analysis

A typical analysis sequence

Before using Bagpipe:

A1. Haplotype reconstruction
Using HAPPY with or without the aid of Prephappy
A2. Phenotype analysis
Identify suitable covariates and transformations

Using Bagpipe:

B1. Single locus genomewide scans

B2. Significance thresholds for single locus analysis (using permutation or null simulation)
  1. Performing null scans
  2. Calculating thresholds from null scans

B3. Visualizing single locus results

B4. QTL characterization
  1. Rough confidence intervals by positional bootstrapping
  2. Estimating effects at the QTL [not yet implemented]

Description of each stage

[Under construction]

Significance by Null Simulation (parametric bootstrap)

In the config file

nullsim.formula     y ~ 1 + (1|Family)

On the command line

bagpipe pheno.config --nullsim --nullsim_collate

Significance by Permutation

Permutation tests should be used with caution and are absolutely not recommended when any of the following conditions are met:

  • Individuals are not equally related
  • There are random effects in the formulae

Moreover, only unrestricted permutation of the phenotypes is currently implemented, which can be inappropriate for some complex cases that arise even in the absence of the conditions above. The author plans to implement residual permutation in due course.

Visualizing Results

Input files

Config file

The first file bagpipe needs is the config file (configuration file). This file specifies the model to be fitted and tells bagpipe where to find the relevant data. Below is a basic example:

analysis.id                 body_weight_withsex
genome.cache.dir            /home/wvaldar/data/cc_cache_happy/
phenotype.file              /home/wvaldar/data/exercise_pheno.txt
scan.formula.null           bodyweight ~ 1 + sex
scan.formula.test           bodyweight ~ 1 + sex + interval.additive(THE.LOCUS)
scan.function               linear
num.nullscans               200
nullscan.seed               1
locus.groups                1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X

Basic options

The fields in the Config file are described below.

Field name Type Required For Description
analysis.id string all The name of the analysis, which will be used to prefix all output files
genome.cache.dir string all The directory where the HAPPY genome cache is stored. The HAPPY cache should include directories for genetic models (eg, additive, full, full.asymmetric, etc) that are intended to be used in the modeling process, and must always include the genotype directory.
phenotype.file string all A tab-delimited file with a single header line, such as would be readable in R using read.delim(). The file should contain a table with columns variables defined in the *.formula.* fields (including the phenotype) and a column entitled SUBJECT.NAME containing the individual ids as in the HAPPY genome cache.
scan.formula.null formula all The null model fitted at a locus in the genome scans, to which the scan.formula.test is compared.
scan.formula.test formula all A formula specifying the test (or alternative) model fitted at a locus. For p-values to be meaningful, scan.formula.test should incorporate and extend scan.formula.null, typically duplicating scan.formula.null and adding one or more predictors.
scan.function code all The type of model fitted. Choose from linear for linear regression with Gaussian errors, or binary for binary logistic regression (currently experimental).
scan.function.options comma separated strings optional special options passed to the scanning function. See advanced options.
num.nullscans integer perm nullsims Number of scans of “null” phenotype data used to estimate significance thresholds. Null phenotype data corresponds to phenotypes generated under an appropriate null model, which may include permutations, null simulations or other fake phenotypes.
nullscan.seed string perm nullsims Where “null” phenotypes are generated using random numbers, this integer sets the random seed.
locus.groups comma separated strings all Groups of loci to be considered in the analysis. A string that matches the name of a chromosome in the HAPPY cache will be interpreted as all marker intervals on that chromosome. Otherwise, the it will be interpreted as a text file containing the names of markers (the file should contain only marker names matching those in the HAPPY cache and these should be separated by white with no other text).

Specifying Formulae

Formulas follow the R language formula specification and generally anything that can be passed to the R functions lm() or lmer() from the lme4 package are acceptable (as this is what generally happens in the code). However, there are some special functions and variables that are signals to bagpipe. These are listed below.

First we define the following arguments and special values:

locus.name
The name of a genotyped marker. For interval.*() functions, this is interpreted as the interval between marker locus.name and the next (ie, downstream) linked marker. Because of this interpretation, when locus.name is used as an argument to an interval.*() function it may not be the last marker on the chromosome.
THE.LOCUS
A special value for locus.name that is substituted at each point in the scan with the current locus.
sdp.string

An alphanumeric string indicating how to combine the J foundervhaplotypes into Q < J super-haplotypes. This grouping is often called a “strain distribution pattern” (SDP; see Yalcin05). The string can be either:

  • A string of letter or numbers that is \(J\) characters in length. For example, assuming \(J=8\), the string 00011222 would be interpreted as “founders 1-3 are in the first group, founders 4 and 5 are in the second, and founders 6-8 are in the third”.
  • A string of words separated by . (full stop or period). For example, the string Wild.Wild.Wild.Dom.Dom.Mus.Mus.Mus would have the same interpretation as 00011222 above.

These variables are used in the following special functions:

interval.additive(locus.name [, sdp=sdp.string])
The HAPPY additive model for locus.name. If sdp is specified, founder haplotypes are first merged according to sdp.string.
interval.dominance(locus.name [, sdp=sdp.string])
The HAPPY dominance model for locus.name. If sdp is specified, founder haplotypes are first merged according to sdp.string.
interval.full(locus.name [, sdp=sdp.string])
The HAPPY full model for locus.name. If sdp is specified, founder haplotypes are first merged according to sdp.string.
interval.prob.het(locus.name)
The sum of the HAPPY probabilities that correspond to a heterozygote diplotype at locus.name.
interval.prob.hom(locus.name)
The sum of the HAPPY probabilities that correspond to a homozygote diplotype at locus.name.
genotype(locus.name)
The genotype at the locus locus.name. Note: any animals with missing genotypes at the locus.name will be removed from the analysis. [EXPERIMENTAL: do not use with THE.LOCUS as argument]

Examples

Here are some example formulae and their interpretations:

  1. Example formulae:

    scan.formula.null   log(weight) ~ 1 + as.factor(age)
    scan.formula.test   log(weight) ~ 1 + as.factor(age) + interval.additive(THE.LOCUS)
Interpretation: test at each marker interval for the additive effect of founder haplotypes on the logarithm of weight, controlling for age (encoded as a factor).
  1. Example formulae:

    scan.formula.null   log(weight) ~ 1 + as.factor(age) + genotype(rs232934)
    scan.formula.test   log(weight) ~ 1 + as.factor(age) + genotype(rs232934) + interval.additive(THE.LOCUS)
Interpretation: test at each marker interval for the additive effect of founder haplotypes on the logarithm of weight, controlling for age (encoded as a factor) and the genotype at marker rs232934 (encoded as a factor).
  1. Example formulae:

scan.formula.null y ~ 1 + interval.additive(THE.LOCUS, sdp=00011222) scan.formula.test y ~ 1 + interval.additive(THE.LOCUS)

Interpretation: test at each marker interval for the additive effect of all eight founder haplotypes, controlling for the additive effect of a hypothetical variant at in the same interval that has three alleles that are distinct for founders 1-3, founders 4-5 and founders 6-8. This can be seen as an additive test at each locus for a haplotype effect that is not subsumed by this SDP.

  1. Example formulae:

scan.formula.null y ~ 1 + (1|day) + interval.additive(THE.LOCUS) scan.formula.test y ~ 1 + (1|day) + interval.additive(THE.LOCUS) + interval.dominance(THE.LOCUS)

Interpretation: test at each marker interval for dominant haplotype effects, controlling for additive haplotype effects, and random effects of day. Note that in the HAPPY regression framework this happens to be equivalent to:

Equivalent formulae:

scan.formula.null   y ~ 1 + (1|day) + interval.additive(THE.LOCUS)
scan.formula.test   y ~ 1 + (1|day) + interval.full(THE.LOCUS)
  1. Example formulae (from SolbergWoods10):

    scan.formula.null   log(AUC) ~ 1 + location + injector + collector + num.injections + (1|sibship)
    scan.formula.test   log(AUC) ~ 1 + location + injector + collector + num.injections + (1|sibship) + interval.full(THE.LOCUS)
Interpretation: test at each marker interval for the additive + dominance effect of founder haplotypes on the logarithm of glucose area under curve, controlling for experiment location (2 sites), injector (person 1 or 2 who injected glucose into the subject), collector (person 1 or 2 who collected the blood for assay), and the number of injections that were given (1, or 2 if the first injection failed to produce a response).
  1. Example formulae (from Johnsen11):

    scan.formula.null       MAT ~ sex + (1|FamilyID) + (1|CohortID) + (1|CageID)
    scan.formula.test       MAT ~ sex + (1|FamilyID) + (1|CohortID) + (1|CageID) + interval.full(THE.LOCUS)
Interpretation: test at each marker interval for the additive + dominance effect of founder haplotypes on mean ankle thickening, controlling for random effects of family id (45 sibships), cohort (38 cohorts), cage id (173 cages) and sex (2 sexes).
  1. Example formulae:

    scan.formula.null   y ~ 1 + interval.additive(THE.LOCUS)
    scan.formula.test   y ~ 1 + interval.additive(THE.LOCUS) + interval.dominance(THE.LOCUS, sdp=01001111)
Interpretation: test at each marker interval for the additive + dominance effect of founder haplotypes on mean ankle thickening, controlling for random effects of family id (45 sibships), cohort (38 cohorts), cage id (173 cages) and sex (2 sexes).

Advanced options

scan.function.options

Example:

scan.function.options reduce.dmat=0.01

Key Name = Value Type

Description

reduce.dmat=float
Discards redundant and near-redundant locus information that might otherwise cause computational problems for some formulas (eg, those including random effect terms). The argument specifies a cutoff for what is discarded. This should be in the range (0,1) but closer to 0. For example, reduce.dmat=0.001 discards little information while reduce.dmat=0.25 discards a lot (although results in a very stable fit). Generally, reduce.dmat=0.1 is a reasonable compromise for 8-way crosses. Details: reduces the dimension of the HAPPY matrix in order to handle multicollinearity would otherwise cause numerical instability for some model fits, specifically those fits in which the underlying R functions do not perform an automatic dimension reduction of the design matrix. Specifying the argument \(\lambda_\text{min} > 0\) converts the non-othogonal \(n \times k\) HAPPY matrix \(\mathbf{G}(m)\) at each locus \(m\) into an orthogonal \(n\times r\) matrix \(\mathbf{G}(m)\) containing the \(r<k\) eigenvectors with scaled eigenvalues less than \(\lambda_\text{min}\).

Phenotype file

The phenotype file should be a tab-delimited file such as would be readable by R using the read.delim() function, have exactly one row per individual, and should include the column SUBJECT.NAME, containing a unique identifier for each individual, plus columns for any covariates used in the model formulae. The file may contain columns not used in the formulae. Missing values should be coded as NA, although note that bagpipe will ignore these individuals if the relevant variable is required for model fitting. The first few lines of an example phenotype file (using faked data based on the analysis in Johnsen11) is below:

SUBJECT.NAME  sex   FamilyID    CageID  CohortID    IAT     MAT
mouse.82      M     Family.9    Cage.2  Cohort.1    5.553   1.054
mouse.84      F     Family.9    Cage.3  Cohort.1    6.369   0.856
mouse.85      F     Family.9    Cage.3  Cohort.1    NA      0.482
mouse.87      F     Family.10   Cage.4  Cohort.2    11.718  0.85
mouse.83      F     Family.9    Cage.3  Cohort.1    8.733   1.077
mouse.79      M     Family.6    Cage.1  Cohort.1    5.377   0.465
mouse.80      M     Family.9    Cage.2  Cohort.1    8.55    0.409
mouse.77      M     Family.6    Cage.1  Cohort.1    5.373   0.655
mouse.78      M     Family.6    Cage.1  Cohort.1    13.488  0.719

Command line options

The text below is from running bagpipe –help

Usage:
  bagpipe configfile [options ...]

Requirements:
Config file: a file of configuration parameters
  HAPPY genome cache: a directory containing cached HAPPY matrices
  Phenotype file: a tab delimited table with phenotype and covariate
          values for each animal. The table must include
          1) a column SUBJECT.NAME giving the id of the animal in
          the corresponding HAPPY data;
          2) a column with the name of the phenotype as listed in the
          Phenotype column of the model menu file;
          3) columns corresponding to any covariates used in the
          model menu formulae.
Key:
n    - an integer
d    - a decimal
s    - a string
b    - 1 for true and 0 for false.
xL   - a comma separated list of type x
locus.group - either the name of a chromosome (eg, 1 or X) or a file of marker names.

Actions:
  scan             - Genome scans
  perm             - Permutations
  perm_collate     - Calculate permutation thresholds
  nullsim          - Null simulations (not yet implemented)
  nullsim_collate  - Calculate null simulation thresholds (not yet implemented)
  plot             - Plot QTL scans to pdf file.

  scanloci=sL      - [Experimental] Scan individual loci, saving all fit information
  posboot=sL       - [Experimental] Perform k positional bootstraps between marker
                      intervals m1 and m2, where s=m1,m2,k

Settings:
  outdir           - Directory for output. Default is ./
  scandir          - Directory for output of scan data files within outdir (./)
  locus_groups=sL  - Run analyses on certain locus groups only [Default: do all in configfile]
  save.at.loci=sL  - [Experimental] Save model fit when scanning at specified loci.
  tailprobs=dL     - Calculate nullscan tail probabilities (ie, significance thresholds).
                     Eg, 0.05,0.01. [Default nL=0.001,0.01,0.05,0.1,0.2]

Plotting:
  plot_box         - Draw box around each plot
  plot_scale=s     - Scale used for plot summary s={cM,bp,Mb}
  plot_score=s     - Specify locus score to plot s={LOD,modelcmp}
  plot_type=s      - s = {chromosomes,genome}
  plot_tailprob=sL - Include threshold lines as calculated by the *_collate options.
                    Each string s should have format _, where
                    nullscantype={perm,nullsim} and  is the tailprob looked up in the
                    corresponding thresh file. Eg, s=perm_0.05 plots the 0.05 threshold
                    calculated by permutation.
  plot_height=d    - Notional height of plotsummary output in inches
  plot_width=d     - Notional width of plotsummary output in inches

Performance:
  cleanup          - Delete Rout files that do not report errors or warnings
  dryrun           - Print commands but do not execute them
  memory=d         - Store marker matrices in memory upto n Mb [default n=0]
  nomenuconfig     - Write the complete config file for an analysis
                ie, incorporating all the information from the menu file
                but minus the menu file
  overwrite=b      - Overwrite existing analysis files [Default overwrite=0]

Help:
  examples         - Give examples
  example_config   - Write an example config file to config.example

To do list

Suggestions to William Valdar. Here is current and incomplete list (in no particular order):

  • Implement residual permutation
  • Allow csv file format for the phenotype file

Frequently Asked Questions

  • Does Bagpipe take into account correlations between chromosomes if I run them all together?

    During the scan and nullsim/permscan (ie, nullscan) actions, Bagpipe analyzes each chromosome (and each locus) separately without explicitly considering correlations between chromosomes (or loci). The scan and nullscan actions can therefore be performed piecemeal and in any order. In order to calculate the actual significance thresholds (actions: nullscan_collate), however, Bagpipe looks at the nullscan results from all chromosomes simultaneously. For this collation step, correlations between loci are implicitly taken into account and results on all tested loci must be present.

  • My genome scans look very noisy, with lots of sharp peaks and troughs. Yet the population I’m using is young enough to have relatively strong linkage disequilibrium between nearby markers, and would therefore be expected to have smooth-looking scans. What am I doing wrong?

    This is most commonly due to a suboptimal HMM reconstruction, specifically that the HMM is inferring too many recombinations, and as a result making the ancestry of nearby loci look less linked than actually they are. It can also arise due to overfitting in the scan procedure resulting from a complex locus association model being fit to a small number of data points (ie, a full diplotype effects model with 36 parameters fit to phenotypes for 69 mice). These problems, and their remedies, are explained below.

    Suboptimal HMM reconstruction by Happy can result from suboptimal specification of the allleles file. The two most common causes are marker locations being specified in bp rather than cM, and/or by failure to accommodate genotyping error in the haplotype probabilities (which can be resolved using Prephappy).

    The problem of incorrect specification of distances can be understood by recalling that an HMM needs two pieces of information: 1) what symbols (alleles) can be emitted (be observed in the genotypes) from each possible state (haplotype), and 2) the probability of transitions (recombination events) between those states (haplotypes) moving from one node (locus) to the next.

    In the Happy HMM, the probability of a state transition is equal to the probability of a recombination event, which in turn depends jointly on the distance between the markers and the number of times that chromosome would have been subject to recombination (ie, the number of meioses, which is equal to the number of generations). This transition probability is implied by the genetic distance between two markers, that is, a scale where “d Morgans” means that you would expect on average d recombinations between those markers in one generation. In model organism genetics, these expected recombination frequencies are key to figuring out locus-specific ancestry, and for convenience genetic distances are expressed in centiMorgans. On the cM scale, therefore, a distance of 100cM implies that you would expect 1 recombination event in each generation, 10 recombination events in 10 generations, and so on.

    The Happy alleles file requires marker locations to be specified in cM distances. It uses the difference between consecutive locations to work out marker-to-marker genetic distances and hence marker-to-marker recombination frequencies. As a rough guide, if two markers are separated by 10cM then recombination events over 10 generations would erode their linkage to the point that they might as well be on different chromosomes for all the information they provide about each other. Suppose that base pair positions have been provided instead of cM positions, such that for example the first marker is 3081519 and the second is located at 3099580. This tells the Happy HMM that we would expect (3099580-3081519)/100 = 180 recombination events on average between the two markers in each generation, effectively forcing inheritance at the two loci to be treated as if independent, and reducing the Happy inference to little better than raw genotyping.

    Getting hold of the cM locations for each marker can be tricky, but there are good approximations. The genetic distance (cM) does not scale linearly with the physical distance (bp) because, apart from the minor mathematical discrepancy, different regions of the genome have different rates of recombination. Nonetheless, if you have bp positions but not ready access to the genetic distances, then an imperfect but often useful tactic is simply to rescale based on global lengths. For example, in mouse 2Mb = 1cM approximately, so dividing the bp by 2 million gives a workable cM scale. Regardless of whether you base your marker cM on an actual genetic map or on a rescaled bp, be sure to avoid markers on the same chromosome having identical locations (implying a zero distance) or markers on the same chromosome have locations that are distinguishable by only the 5th (or higher) decimal place, since computationally this can lead to truncation-to-zero or numerical instability in the Happy code. In avoiding these pitfalls you may need to space apart the locations of markers that are tightly clustered, but unlike the alternatives this will have a negligible impact on the fidelity of the HMM inference.

    Failure to account for genotyping error: [to do]

    Fitting of overcomplex models to small data sets: [to do]

  • I’m using a random effect and I get an error about Downdated X’X:

    Error in mer_finalize(ans) : Downdated X'X is not positive definite, 10.
    Warning in general.scan(h, data = d$pdata, markers = markers,
    null.formula = d$scan.options$null.formula,:
    Warning: could not fit alternative model for D1Rat32

    This problem usually arises because the design matrix specifying genetic effects at a locus is multicollinear (ie, the matrix has d columns and therefore purports to be d-dimensional, but the data in the matrix can actually fit into a >d-dimensional space; see Appendix B of Valdar09) and the random effects software used to fit the model cannot work around this multicollinearity. In this case, you have to help the software by reducing the dimensionality of the locus matrix using the reduce.dmat option in scan.function.options .

References

[Johnsen11]Johnsen AK, Valdar W, Golden L, Ortiz-Lopez A, Hitzemann R, Flint J, Mathis D, Benoist C (2011) Genetics of arthritis severity: a genome-wide and species-wide dissection in HS mice. Arthritis & Rheumatism 63(9):2630-2640. PMID:21565963
[Mott00]Mott R, Talbot CJ, Turri MG, Collins AC, Flint, J (2000) A method for fine mapping quantitative trait loci in outbred animal stocks. Proceedings of the National Academy of Sciences of the United States of America, 97(23), 12649-54.
[SolbergWoods10]Solberg Woods L, Holl K, Tschannen M, Valdar W (2010) Fine-mapping a locus for glucose tolerance using heterogeneous stock rats. Physiological Genomics 41(1):102-8. PMCID:PMC2841497.
[Valdar06]Valdar W, Solberg LC, Gaugier D, Burnett S, Klenerman P, Cookson WO, Taylor M, Rawlins JNP, Mott R, Flint J (2006) Genome-wide genetic association of complex traits in outbred mice. Nature Genetics 38(8):879-87. PMID:16832355
[Valdar09]Valdar W, Holmes CC, Mott R, Flint J (2009) Mapping in structured populations by resample model averaging. Genetics 182(4):1263-77.
[Yalcin05]Yalcin B, Flint J, Mott R (2005) Using progenitor strain information to identify quantitative trait nucleotides in outbred mice. Genetics, 171(2), 673-81.