BCB720: Introduction to Statistical Modeling

Fall 2013 Syllabus

Last Updated: 2013-10-21

Basic Information

Course identifiers: This document describes the syllabus for BCB720 in the Bioinformatics and Computational Biology Curriculum of BBSP.

Time: 10am - 11.15am, Tue/Thu

Location: MacNider 322 (primary) and Bondurant G74 (Oct 8 and Oct 15 only).

Materials: All learning materials will be posted on Sakai.

Restrictions: Class is limited to 25 students.

Instructors

Lead instructor: Prof William Valdar, Room 5113,120 Mason Farm Road, Genetic Medicine Building, Campus Box # 7264, Chapel Hill NC 27599. Tel: +1 919 843 2833. Email: <william.valdar@unc.edu> Web: <http://valdarlab.unc.edu>

Co-instructor: Prof Ethan Lange, Room 5111, 120 Mason Farm Road, Genetic Medicine Building, Campus Box # 7264, Chapel Hill NC 27599. Tel: +1 919 966 3356. Email: <ethan_lange@med.unc.edu>. Web: <http://genetics.unc.edu/faculty/ethan-lange>

Teaching Assistants: Andrew Morgan <andrew_morgan@med.unc.edu>, Greg Keele <gkeele@email.unc.edu>. Office hours for the TA will be arranged and posted at the beginning of the first class and updated as necessary.

Course Description

This module introduces foundational statistical concepts and models that motivate a wide range of analytic methods in bioinformatics, statistical genetics, statistical genomics, and related fields. It is an intensive course, packing a year’s worth of probability and statistics into 2/3 of a semester. It covers probability, common distributions, Bayesian inference, maximum likelihood and frequentist inference, linear models, generalized and hierarchical linear models, and causal inference.

Target Audience

This course is targeted at graduate students in BBSP with either a quantitative background or strong quantitative interests who would like to understand and/or develop statistical methods for analyzing complex biological/biomedical data. In particular, it is intended to provide a spring-board for BBSP who would subsequently like to take graduate-level statistical courses elsewhere on campus.

Course Pre-requisites

Students are expected to know single-variable calculus (differentiation and integration in 1 dimension), be familiar with matrix algebra and have some programming experience. The course will include material on partial differentiation of multiparameter functions, and use the statistical package R extensively. Familiarity with these will be an advantage but is not assumed. Introductory statistics may or may not be an advantage (depending on how it was taught), but is not assumed.

Restrictions

The course is open to all graduate students of the Biological and Biomedical Sciences Program (BBSP) at UNC Chapel Hill. Other students, staff, or faculty may attend for credit, on an auditor basis or informally only if

Š            They have prior permission from the lead instructor, and

Š            There is space: that is, if they are not taking up a spot that would be otherwise used by a non-auditing (ie, full credit) BBSP student.

Moreover, graduate students from the Department of Biostatistics (BIOS) or the Department of Statistics and Operations Research (STOR) may audit only, and may not receive credit for this course.

Course Goals and Key Learning Objectives

1.        Probability and distributions

2.        Properties of random variables

3.        Bayesian and frequentist approaches to statistical inference

4.        Hypothesis testing

5.        Linear models

6.        Generalized linear models

7.        Hierarchical models

Course Requirements

To obtain full credit, students must attend at least 80% of the lectures, complete all homeworks, and achieve at least a passing overall grade.

Dates

Homework assignments will typically be distributed on Tuesdays or Wednesdays after class, with a deadline for electronic submission at least a week later, typically noon on the Friday of the following week. Anonymous student evaluations, required for 5% of the course marks, will be distributed for completion on Sakai within approximately a week of course completion. Students will have a week to complete the student evaluation.

Grades

Grades for the course (F,L,P,H) will be based on performance in the homeworks and on completion of the course evaluation. Specifically, the homeworks collectively account for 95% of the course marks, and completion of the anonymous evaluation accounts for the remaining 5%. Each homework will include multiple questions each providing a stated maximum number of points. The total number of points achieved by a student divided by the total possible will be scaled to the range 0 to 95 and used as the percentage of the grade arising from coursework. There is no final exam.

Course Policies

Students must attend the entire duration of at least 80% of the lectures unless they have permission of the lead instructor to do otherwise. Students are expected to be prompt, polite, collaborative when (and only when) asked, and to answer questions in class. Failure to hand in a homework on time without reasonable justification (eg, sickness) will result in automatic loss of 10% of that homework’s maximum allowable points for each day over the deadline.

Time Table

Key: (B) = Class held in Bondurant G74 rather than MacNider 322; (C) = Students should bring (or be prepared to share) a laptop

Week

Date

Instructor

Lecture

Description

Homework

1

Tue, Aug 20

AM

1 (C)

Introduction to R

Homework 1 (WV)

 

Thu, Aug 22

WV

2

Set theory and probability

 

2

Tue, Aug 27

WV

3

Conditional Probability

Homework 2 (WV)

 

Thu, Aug 29

WV

4

Distribution, Mass and Density functions

 

3

Tue, Sep 3

WV

5

Expectation and Variance

Homework 3 (WV)

 

Thu, Sep 5

WV

6

Discrete distributions

 

4

Tue, Sep 10

WV

7

Continuous distributions

Homework 4 (WV)

 

Thu, Sep 12

WV

8

Bayesian inference

 

5

Tue, Sep 17

WV

9

Estimation

Homework 5 (WV)

 

Thu, Sep 19

EL

10

MLEs, bias

 

6

Tue, Sep 24

EL

11

Confidence intervals

 

 

Thu, Sep 26

EL

12

Hypothesis testing

Homework 6 (EL)

7

Tue, Oct 1

EL

13

Introduction to regression

 

 

Thu, Oct 3

EL

14

Multiple regression/ANOVA

 

8

Tue, Oct 8

EL

15 (B)

Multiple regression/ANOVA

 

 

Thu, Oct 10

EL

16

Multiple regression in class examples

Homework 7 (EL)

9

Tue, Oct 15

GK

17 (B)

Logistic regression

 

 

Thu, Oct 17

 

 

FALL BREAK

Homework 8 (WV)

10

Tue, Oct 22

WV

18

Causal inference

 

 

Tue, Oct 24

WV

19 (C)

Bayesian and frequentist regression

Homework 10 (WV)

11

Tue, Aug 29

WV

20 (C)

Decisions about modeling

 

 

Thu, Aug 31

WV

21

Hierarchical and penalized regression

 

Syllabus Changes

The lead and/or co-instructors reserve to right to make changes to the syllabus, including homework due dates.

Course Resources

There is no course textbook as such because no textbook covers all the material in this course. Some textbooks that may be useful for supplemental reading are given below.

1st half of the course:

Westfall & Henning (2013) "Understanding Advanced Statistical Methods" -- chatty, top recommendation

DeGroot & Schervish (2011) "Probability and Statistics" -- thorough explanation of first half of course

Wasserman (2009) "All of Statistics" -- last year's recommendation but can be a bit terse

2nd half of the course:

Gelman & Hill (2007) -- fantastic for understanding linear models and estimation, but not hypothesis testing

More basic than this course:

Verzani (2004) "Using R for introductory statistics" -- friendly chatty book on R, used for GNET course

Also useful:

Johnsen & Wichern (2004) "Applied Multivariate Statistical Analysis" -- good intro to matrix algebra (chapter 2)

Venables & Ripley (2002) "Modern Applied Statistics with S" -- very terse but comprehensive on R (available free online)

More references (eg, for specific subjects) will be given during and at the end of the course. Students are encouraged to ask the instructors for recommendations for books/resources on specific subjects or books/resources aimed at different levels.

Honor Code:

Students may collaborate in class, but each student’s homework should be their own. In completing the homework, however, students are nonetheless encouraged to consult the lecture notes, online material, books and any other “passive” sources. They may discuss general strategies and concepts with their classmates and with the TA, and may ask the TA for clarification about the content of questions. The TA may provide guidance as to where they might be able to find example material that addresses problems similar (but not identical) to those posed in the homework

Selected descriptions of the course from last year's students

"BCB 720 is an accelerated and concise overview of probability and statistics from both a frequentist and bayesian perspective. This course introduces students to probability theory, probability distributions, hypothesis testing, and linear modeling."

"Strenuous course which outlines many of the fundamental elements of statistics used in common biological problems with large datasets. Includes both a Bayesian perspective and traditional methods as well, including hypothesis testing and linear and logistic regression formulation."

"A really hard crash-course in probability and statistics for modern-day bioinformatics."

"The course has heavy workload but the material does come up in research and in other classes therefore it can be very valuable."