.. _latest bagpipe.backend.tar.gz: http://valdarlab.unc.edu/software/bagpipe/install/bagpipe.backend_0.34.tar.gz
.. _latest bagpipe.tar.gz: http://valdarlab.unc.edu/software/bagpipe/install/bagpipe_2012_02_15.tar.gz
.. _HAPPY: http://www.well.ox.ac.uk/happy/happyR.shtml
.. _Prephappy: http://valdarlab.unc.edu/software.html
.. _William Valdar: http://valdarlab.unc.edu/contact.html
*******************************
Bagpipe: genomewide LD mapping
*******************************
.. figure:: bagpipe_mouse.png
:height: 150px
:scale: 100 %
:align: right
:alt:
Bagpipe is a program for performing genomewide linkage disequilibrium mapping of quantitative trait loci in populations whose genome structure can be accommodated in the HAPPY_ framework [Mott00]_. This includes most diploid crosses where the founders of the individuals have known genotypes.
About
=====
Quick facts about Bagpipe:
- Bagpipe is a simplified and streamlined version of Bagphenotype_ that does not currently include resample model averaging (RMA) capabilities.
- Bagpipe can help fit single locus regression models (with or without random effects) to marker intervals whose genetic ancestry is inferred using the HAPPY_ software.
- Bagpipe cannot help you decide what is a sensible model to fit.
- Bagpipe does not currently accommodate populations with significant population structure, except through the specification of simple random intercepts based on unpatterned covariance matrices.
- Bagpipe is named after the Scottish wind instrument "the bagpipes" and after Bagphenotype, which in turn was a *PIPEline* for *BAGging-based multiple QTL analysis of phenoTYPEs*. Bagphenotype was in turn based on software written by Richard Mott and William Valdar to analyze heterogeneous stock mice in Valdar06_.
- Bagpipe is experimental software, is provided free of charge subject to copyleft restrictions, and comes with no guarantees whatsoever.
.. _Bagphenotype: bagphenotype.html
For constructive comments and suggestions email the author `William Valdar`_. The model and optimizations implemented in bagpipe are mostly described in the single locus sections and appendix of Valdar09_. The basic HAPPY regression model was introduced by Mott00_.
Current limitations
-------------------
Bagpipe currently only implements mapping using a standard fixed effects linear model and assessment of thresholds by unrestricted permutation. The author is incorporating random effects and generalized linear modeling and other features that were present in Bagphenotype_.
Current bugs
------------
2013-12-10: Bagpipe is currently incompatible with the new version of lme4 (1.0 or greater). It should still work under pre-1.0 lme4. This bug is currently being fixed.
Updates
=======
.. |date| date::
.. |time| date:: %H:%M
Last webpage update |date| at |time|. Major updates are listed below:
2013-02-04: Major webpage update, using the `Sphinx `_ document generator.
2012-02-08: bagpipe.bundle renamed to bagpipe.backend to avoid confusion
with \*.bundle files on Mac OS X.
2011-06-19: experimental functions for positional bootstrap and for
saving model fits at a few specified loci.
2011-05-29: due to the authors recent improved understanding of R
packages, the bagpipe program is now supported by a single R package
rather than two.
2011-05-28: strain-merging, ie, collapsing :math:`J` haplotypes into :math:`Q < J`
groups, is now implemented in the config file formulas.
Installation
============
R/bagpipe.backend
-----------------
The R package R/bagpipe.backend contains the core R functions for fitting models in bagpipe.
#. Download the `latest bagpipe.backend.tar.gz`_
#. Open a UNIX shell and go to the directory containing this file
#. Type ``R CMD install --clean --no-docs`` followed by the name of the file
Note: bagpipe.backend was previously called bagpipe.bundle. The name was changed because ``*.bundle`` files have a special interpretation on Mac OS X.
bagpipe.pl
----------
Bagpipe is organized as a central perl script that calls a number of R scripts for specific tasks and that relies upon several perl modules. All these components are provided in a single download.
#. Download the `latest bagpipe.tar.gz`_ and put it in a directory you would like to run it from, eg, ``~/bin/bagpipe/``.
#. Open a UNIX shell and go to ``~/bin/bagpipe/``.
#. Unzip and unarchive by typing ``tar -zxf`` followed by the filename
#. Set an environmental variable ``BAGPIPE_LIBS`` that will be interpreted by the shell as ``~/bin/bagpipe/``. For example, if you are using the bash shell then in your .bash_profile add the line ``export BAGPIPE_LIBS=~/bin/bagpipe/lib/``.
#. Set an alias to the bagpipe program so you can run it from anywhere. For example, if you are using the bash shell then in your ``.bash_profile`` add the line ``alias bagpipe=~/bin/bagpipe/bagpipe.pl``.
Stages of Analysis
==================
A typical analysis sequence
---------------------------
Before using Bagpipe:
A1. Haplotype reconstruction
Using HAPPY_ with or without the aid of Prephappy_
A2. Phenotype analysis
Identify suitable covariates and transformations
Using Bagpipe:
B1. Single locus genomewide scans
B2. Significance thresholds for single locus analysis (using permutation or null simulation)
i. Performing null scans
ii. Calculating thresholds from null scans
B3. Visualizing single locus results
B4. QTL characterization
i. Rough confidence intervals by positional bootstrapping
ii. Estimating effects at the QTL [not yet implemented]
Description of each stage
-------------------------
[Under construction]
Significance by Null Simulation (parametric bootstrap)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the config file
::
nullsim.formula y ~ 1 + (1|Family)
On the command line
::
bagpipe pheno.config --nullsim --nullsim_collate
Significance by Permutation
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Permutation tests should be used with caution and are absolutely **not
recommended** when any of the following conditions are met:
- Individuals are not equally related
- There are random effects in the formulae
Moreover, **only unrestricted permutation of the phenotypes is currently
implemented**, which can be inappropriate for some complex cases that
arise even in the absence of the conditions above. The author plans to
implement residual permutation in due course.
Visualizing Results
-------------------
Input files
===========
Config file
-----------
The first file bagpipe needs is the config file (configuration file). This file specifies the model to be fitted and tells bagpipe where to find the relevant data. Below is a basic example::
analysis.id body_weight_withsex
genome.cache.dir /home/wvaldar/data/cc_cache_happy/
phenotype.file /home/wvaldar/data/exercise_pheno.txt
scan.formula.null bodyweight ~ 1 + sex
scan.formula.test bodyweight ~ 1 + sex + interval.additive(THE.LOCUS)
scan.function linear
num.nullscans 200
nullscan.seed 1
locus.groups 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X
Basic options
~~~~~~~~~~~~~
The fields in the Config file are described below.
===================== =========== ============ =============
Field name Type Required For Description
===================== =========== ============ =============
analysis.id string all The name of the analysis, which will be used to prefix all output files
--------------------- ----------- ------------ -------------
genome.cache.dir string all The directory where the HAPPY genome cache is stored. The HAPPY cache should include directories for genetic models (eg, additive, full, full.asymmetric, etc) that are intended to be used in the modeling process, and must always include the *genotype* directory.
--------------------- ----------- ------------ -------------
phenotype.file string all A tab-delimited file with a single header line, such as would be readable in R using ``read.delim()``. The file should contain a table with columns variables defined in the ``*.formula.*`` fields (including the phenotype) and a column entitled SUBJECT.NAME containing the individual ids as in the HAPPY genome cache.
--------------------- ----------- ------------ -------------
scan.formula.null formula all The null model fitted at a locus in the genome scans, to which the scan.formula.test is compared.
--------------------- ----------- ------------ -------------
scan.formula.test formula all A formula specifying the test (or alternative) model fitted at a locus. For p-values to be meaningful, scan.formula.test should incorporate and extend scan.formula.null, typically duplicating scan.formula.null and adding one or more predictors.
--------------------- ----------- ------------ -------------
scan.function code all The type of model fitted. Choose from ``linear`` for linear regression with Gaussian errors, or ``binary`` for binary logistic regression (currently experimental).
--------------------- ----------- ------------ -------------
scan.function.options comma
separated
strings optional special options passed to the scanning function. See `advanced options`_.
--------------------- ----------- ------------ -------------
num.nullscans integer perm
nullsims Number of scans of "null" phenotype data used to estimate significance thresholds. Null phenotype data corresponds to phenotypes generated under an appropriate null model, which may include permutations, null simulations or other fake phenotypes.
--------------------- ----------- ------------ -------------
nullscan.seed string perm
nullsims Where "null" phenotypes are generated using random numbers, this integer sets the random seed.
--------------------- ----------- ------------ -------------
locus.groups comma
separated
strings all Groups of loci to be considered in the analysis. A string that matches the name of a chromosome in the HAPPY cache will be interpreted as all marker intervals on that chromosome. Otherwise, the it will be interpreted as a text file containing the names of markers (the file should contain only marker names matching those in the HAPPY cache and these should be separated by white with no other text).
===================== =========== ============ =============
Specifying Formulae
~~~~~~~~~~~~~~~~~~~
Formulas follow the R language formula specification and generally anything that can be passed to the R functions lm() or lmer() from the lme4 package are acceptable (as this is what generally happens in the code). However, there are some special functions and variables that are signals to bagpipe. These are listed below.
First we define the following arguments and special values:
*locus.name*
The name of a genotyped marker. For ``interval.*()`` functions, this is interpreted as the interval between marker *locus.name* and the next (ie, downstream) linked marker. Because of this interpretation, when *locus.name* is used as an argument to an ``interval.*()`` function it may not be the last marker on the chromosome.
``THE.LOCUS``
A special value for *locus.name* that is substituted at each point in the scan with the current locus.
*sdp.string*
An alphanumeric string indicating how to combine the *J* foundervhaplotypes into *Q < J* super-haplotypes. This grouping is often called a "strain distribution pattern" (SDP; see Yalcin05_). The string can be either:
- A string of letter or numbers that is :math:`J` characters in length. For
example, assuming :math:`J=8`, the string ``00011222`` would be interpreted as "founders 1-3 are in the first group, founders 4 and 5 are in the second, and founders 6-8 are in the third".
- A string of words separated by ``.`` (full stop or period). For
example, the string ``Wild.Wild.Wild.Dom.Dom.Mus.Mus.Mus`` would have
the same interpretation as ``00011222`` above.
These variables are used in the following special functions:
``interval.additive(locus.name [, sdp=sdp.string])``
The HAPPY additive model for *locus.name*. If sdp is specified, founder haplotypes are first merged according to *sdp.string*.
``interval.dominance(locus.name [, sdp=sdp.string])``
The HAPPY dominance model for *locus.name*. If sdp is specified, founder haplotypes are first merged according to *sdp.string*.
``interval.full(locus.name [, sdp=sdp.string])``
The HAPPY full model for *locus.name*. If sdp is specified, founder haplotypes are first merged according to *sdp.string*.
``interval.prob.het(locus.name)``
The sum of the HAPPY probabilities that correspond to a heterozygote diplotype at *locus.name*.
``interval.prob.hom(locus.name)``
The sum of the HAPPY probabilities that correspond to a homozygote diplotype at *locus.name*.
``genotype(locus.name)``
The genotype at the locus *locus.name*. Note: any animals with missing genotypes at the *locus.name* will be removed from the analysis. [EXPERIMENTAL: do not use with THE.LOCUS as argument]
Examples
~~~~~~~~
Here are some example formulae and their interpretations:
#. Example formulae::
scan.formula.null log(weight) ~ 1 + as.factor(age)
scan.formula.test log(weight) ~ 1 + as.factor(age) + interval.additive(THE.LOCUS)
*Interpretation:* test at each marker interval for the additive effect of founder haplotypes on the logarithm of weight, controlling for age (encoded as a factor).
#. Example formulae::
scan.formula.null log(weight) ~ 1 + as.factor(age) + genotype(rs232934)
scan.formula.test log(weight) ~ 1 + as.factor(age) + genotype(rs232934) + interval.additive(THE.LOCUS)
*Interpretation:* test at each marker interval for the additive effect of founder haplotypes on the logarithm of weight, controlling for age (encoded as a factor) and the genotype at marker rs232934 (encoded as a factor).
#. Example formulae::
scan.formula.null y ~ 1 + interval.additive(THE.LOCUS, sdp=00011222)
scan.formula.test y ~ 1 + interval.additive(THE.LOCUS)
*Interpretation:* test at each marker interval for the additive effect of all eight founder haplotypes, controlling for the additive effect of a hypothetical variant at in the same interval that has three alleles that are distinct for founders 1-3, founders 4-5 and founders 6-8. This can be seen as an additive test at each locus for a haplotype effect that is not subsumed by this SDP.
#. Example formulae::
scan.formula.null y ~ 1 + (1|day) + interval.additive(THE.LOCUS)
scan.formula.test y ~ 1 + (1|day) + interval.additive(THE.LOCUS) + interval.dominance(THE.LOCUS)
*Interpretation:* test at each marker interval for dominant haplotype effects, controlling for additive haplotype effects, and random effects of day. Note that in the HAPPY regression framework this happens to be equivalent to:
Equivalent formulae::
scan.formula.null y ~ 1 + (1|day) + interval.additive(THE.LOCUS)
scan.formula.test y ~ 1 + (1|day) + interval.full(THE.LOCUS)
#. Example formulae (from SolbergWoods10_)::
scan.formula.null log(AUC) ~ 1 + location + injector + collector + num.injections + (1|sibship)
scan.formula.test log(AUC) ~ 1 + location + injector + collector + num.injections + (1|sibship) + interval.full(THE.LOCUS)
*Interpretation:* test at each marker interval for the additive + dominance effect of founder haplotypes on the logarithm of glucose area under curve, controlling for experiment location (2 sites), injector (person 1 or 2 who injected glucose into the subject), collector (person 1 or 2 who collected the blood for assay), and the number of injections that were given (1, or 2 if the first injection failed to produce a response).
#. Example formulae (from Johnsen11_)::
scan.formula.null MAT ~ sex + (1|FamilyID) + (1|CohortID) + (1|CageID)
scan.formula.test MAT ~ sex + (1|FamilyID) + (1|CohortID) + (1|CageID) + interval.full(THE.LOCUS)
*Interpretation:* test at each marker interval for the additive + dominance effect of founder haplotypes on mean ankle thickening, controlling for random effects of family id (45 sibships), cohort (38 cohorts), cage id (173 cages) and sex (2 sexes).
#. Example formulae::
scan.formula.null y ~ 1 + interval.additive(THE.LOCUS)
scan.formula.test y ~ 1 + interval.additive(THE.LOCUS) + interval.dominance(THE.LOCUS, sdp=01001111)
*Interpretation:* test at each marker interval for the additive + dominance effect of founder haplotypes on mean ankle thickening, controlling for random effects of family id (45 sibships), cohort (38 cohorts), cage id (173 cages) and sex (2 sexes).
Advanced options
~~~~~~~~~~~~~~~~
scan.function.options
"""""""""""""""""""""
Example:
::
scan.function.options reduce.dmat=0.01
Key Name = Value Type
Description
reduce.dmat=\ *float*
Discards redundant and near-redundant locus information that might otherwise cause computational problems for some formulas (eg, those including random effect terms). The argument specifies a cutoff for what is discarded. This should be in the range (0,1) but closer to 0. For example, ``reduce.dmat=0.001`` discards little information while ``reduce.dmat=0.25`` discards a lot (although results in a very stable fit). Generally, ``reduce.dmat=0.1`` is a reasonable compromise for 8-way crosses. *Details:* reduces the dimension of the HAPPY matrix in order to handle multicollinearity would otherwise cause numerical instability for some model fits, specifically those fits in which the underlying R functions do not perform an automatic dimension reduction of the design matrix. Specifying the argument :math:`\lambda_\text{min} > 0` converts the non-othogonal :math:`n \times k` HAPPY matrix :math:`\mathbf{G}(m)` at each locus :math:`m` into an orthogonal :math:`n\times r` matrix :math:`\mathbf{G}(m)` containing the :math:`rd`-dimensional space; see Appendix B of Valdar09_) and the random effects software used to fit the model cannot work around this multicollinearity. In this case, you have to help the software by reducing the dimensionality of the locus matrix using the *reduce.dmat* option in scan.function.options_ .
References
==========
.. [Johnsen11] Johnsen AK, Valdar W, Golden L, Ortiz-Lopez A, Hitzemann R, Flint J, Mathis D, Benoist C (2011) Genetics of arthritis severity: a genome-wide and species-wide dissection in HS mice. Arthritis & Rheumatism 63(9):2630-2640. PMID:21565963
.. [Mott00] Mott R, Talbot CJ, Turri MG, Collins AC, Flint, J (2000) A method for fine mapping quantitative trait loci in outbred animal stocks. Proceedings of the National Academy of Sciences of the United States of America, 97(23), 12649-54.
.. [SolbergWoods10] Solberg Woods L, Holl K, Tschannen M, Valdar W (2010) Fine-mapping a locus for glucose tolerance using heterogeneous stock rats. Physiological Genomics 41(1):102-8. PMCID:PMC2841497.
.. [Valdar06] Valdar W, Solberg LC, Gaugier D, Burnett S, Klenerman P, Cookson WO, Taylor M, Rawlins JNP, Mott R, Flint J (2006) Genome-wide genetic association of complex traits in outbred mice. Nature Genetics 38(8):879-87. PMID:16832355
.. [Valdar09] Valdar W, Holmes CC, Mott R, Flint J (2009) Mapping in structured populations by resample model averaging. Genetics 182(4):1263-77.
.. [Yalcin05] Yalcin B, Flint J, Mott R (2005) Using progenitor strain information to identify quantitative trait nucleotides in outbred mice. Genetics, 171(2), 673-81.