1 Installing Bioconductor

Bioconductor is the bioinformatics project for R. The best way to install and maintain Bioconductor and associated packages is using a function called biocLite(). The latest version of this function is obtained from the Bioconductor website using the source() function. You then use biocLite() to install the base Bioconductor packages, i.e. Biobase and any additional packages that you might want. Below is the code that will get the latest version of biocLite() then install Bioconductor and the other packages needed for this session.

How do you know which packages to install? One way is to learn how to "shop" at Bioconductor. Go to http://www.bioconductor.org.

All of this code is currently "commented out" using "#" to indicate that a line is a comment. R will ignore any line that starts with one or more "#", so adding this to a statement that would normally execute is a good way to include optional steps. Commenting-out can also be useful when troubleshooting. For this class, we have installed all required R/Bioconductor packages, so you should not un-comment these lines. However, if you want to run this code on your own machine, you must un-comment these lines so that the packages will be installed. Any line that starts with "##" is an actual comment and should be left as it is.

Two packages must be installed from CRAN (The Comprehensive R Archive Network) using the install.packages() function. Do not try to use install.packages() for Bioconductor packages or biocLite() with CRAN packages. It generally will not work.

Finally, please note that the installation of a single package should also install any dependencies (additional packages that are required for a package to function). As you build your R library, you will find that you rarely install a single package--multiple dependencies are common.

In [1]:
# cd scratch
# mkdir unit1
# cd unit1
# mkdir MicroarrayAnalysis
# cd MicroarrayAnalysis
# cp /depot/nihomics/data/microarray/Rscripts/IntroToMicroArrayAnalysis.R .
# rstudio
In [2]:
# #get biocLite()
# source("http://bioconductor.org/biocLite.R")
# #install base Bioconductor
# biocLite()
# #install packages for this analysis
# biocLite("GEOquery")
# biocLite("affy")
# biocLite("limma")
# biocLite("hgu133plus2.db")
# biocLite("Homo.sapiens")
# #install package from CRAN
# install.packages("hexbin")
# install.packages("RColorBrewer")

2 The Gene Expression Omnibus

The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information is a data repository that contains a wealth of information. The data in GEO is available in both processed and raw formats. This data can be downloaded and manipulated manually, but it is more efficient to handle it programmatically with R. First, we are going to take a quick tour of GEO and our example dataset for this unit.

Open Chrome and go to http://www.ncbi.nlm.nih.gov/geo/.

3 Using GEOquery to Obtain Microarray Data

The GEOquery package allows you to programmatically access data from GEO. Depending on your needs, you can download only the processed data and metadata provided by the depositor. In some cases, you may want to download the raw data as well, if it was provided by the depositor. The GEOquery package allows you to do both.

In [6]:
library(GEOquery)
my.gse <- "GSE15947"
##get published data and metadata
##this step is slow because of download and reformatting
##create directory for files downloaded from GEO
if(!file.exists("geo_downloads")) dir.create("geo_downloads")
if(!file.exists("results"))  dir.create("results", recursive=TRUE)
##get data from GEO
my.geo.gse <- getGEO(GEO=my.gse, filename=NULL, destdir="./geo_downloads", GSElimits=NULL, GSEMatrix=TRUE, AnnotGPL=FALSE, getGPL=FALSE)
##explore structure of data
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15947/matrix/
Found 1 file(s)
GSE15947_series_matrix.txt.gz
Using locally cached version: ./geo_downloads/GSE15947_series_matrix.txt.gz
In [7]:
class(my.geo.gse)
'list'
In [8]:
length(my.geo.gse)
1
In [9]:
names(my.geo.gse)
'GSE15947_series_matrix.txt.gz'
In [10]:
##get rid of list structure
my.geo.gse <- my.geo.gse[[1]]
##object is now an ExpressionSet
class(my.geo.gse)
'ExpressionSet'
In [11]:
##you can see the structure of the object
str(my.geo.gse)
Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
  ..@ experimentData   :Formal class 'MIAME' [package "Biobase"] with 13 slots
  .. .. ..@ name             : chr ""
  .. .. ..@ lab              : chr ""
  .. .. ..@ contact          : chr ""
  .. .. ..@ title            : chr ""
  .. .. ..@ abstract         : chr ""
  .. .. ..@ url              : chr ""
  .. .. ..@ pubMedIds        : chr ""
  .. .. ..@ samples          : list()
  .. .. ..@ hybridizations   : list()
  .. .. ..@ normControls     : list()
  .. .. ..@ preprocessing    : list()
  .. .. ..@ other            : list()
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 2
  .. .. .. .. .. ..$ : int [1:3] 1 0 0
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ assayData        :<environment: 0x000000001bf2b4b0> 
  ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame':	36 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr [1:36] NA NA NA NA ...
  .. .. ..@ data             :'data.frame':	24 obs. of  36 variables:
  .. .. .. ..$ title                  : Factor w/ 24 levels "RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep1",..: 21 22 23 24 9 10 11 12 13 14 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ geo_accession          : Factor w/ 24 levels "GSM400051","GSM400052",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ status                 : Factor w/ 1 level "Public on Dec 18 2009": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ submission_date        : Factor w/ 1 level "May 04 2009": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ last_update_date       : Factor w/ 1 level "Dec 18 2009": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ type                   : Factor w/ 1 level "RNA": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ channel_count          : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ source_name_ch1        : Factor w/ 6 levels "RWPE1 cells treated for 24 hours with 100 nM 1,25(OH)2D",..: 6 6 6 6 5 5 5 5 2 2 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ organism_ch1           : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ characteristics_ch1    : Factor w/ 1 level "cell line: RWPE-1": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ characteristics_ch1.1  : Factor w/ 1 level "cell type: Immortalized normal prostate epithelial cells.": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ characteristics_ch1.2  : Factor w/ 2 levels "treatment: 100 nM 1,25(OH)2D",..: 2 2 2 2 1 1 1 1 2 2 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ characteristics_ch1.3  : Factor w/ 3 levels "time point: 24 hours",..: 3 3 3 3 3 3 3 3 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ treatment_protocol_ch1 : Factor w/ 1 level "RWPE1 cells were treated with medium containing 100 nM of 1,25(OH)2D or vehicle (0.1% ethanol) for 6, 24 or 48 hours (n=4 per t"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ growth_protocol_ch1    : Factor w/ 1 level "RWPE1 cells were plated in T75 flasks (1x10^6 cells per flask) and grown until cells reached 60% confluence. For the 48 hours t"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ molecule_ch1           : Factor w/ 1 level "total RNA": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ extract_protocol_ch1   : Factor w/ 1 level "Total RNA was isolated from the cells using TriReagent (Molecular Research Center, Inc., Cincinnati, OH) in accordance with the"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ label_ch1              : Factor w/ 1 level "biotin": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ label_protocol_ch1     : Factor w/ 1 level "The transcripts levels in each sample were measured by using the Affymetrix HU133 plus 2.0 GeneChip (Affymetrix, Santa Clara, C"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ taxid_ch1              : Factor w/ 1 level "9606": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ hyb_protocol           : Factor w/ 1 level "Chip hybridization was carried by using standard Affymetrix protocols (Affymetrix, Santa Clara, CA) at the Ohio State Universit"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ scan_protocol          : Factor w/ 1 level "Chip scanning was carried out by using standard Affymetrix protocols (Affymetrix, Santa Clara, CA) at the Ohio State University"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ description            : Factor w/ 6 levels "Gene expression of 1,25(OH)2D treated RWPE1 at 24h.",..: 6 6 6 6 3 3 3 3 4 4 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ data_processing        : Factor w/ 1 level "Microarray data from CEL files for all 24 chips was normalized simultaneously and expression levels were generated using the gc"| __truncated__: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ platform_id            : Factor w/ 1 level "GPL570": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_name           : Factor w/ 1 level "Pavlo,,Kovalenko": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_email          : Factor w/ 1 level "pkovalen@purdue.edu": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_department     : Factor w/ 1 level "Foods and Nutrition": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_institute      : Factor w/ 1 level "Purdue University": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_address        : Factor w/ 1 level "700 W State st": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_city           : Factor w/ 1 level "West Lafayette": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_state          : Factor w/ 1 level "IN": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_zip/postal_code: Factor w/ 1 level "47906": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ contact_country        : Factor w/ 1 level "USA": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ supplementary_file     : Factor w/ 24 levels "ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM400nnn/GSM400051/GSM400051.CEL.gz",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. .. ..$ data_row_count         : Factor w/ 1 level "54675": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:24] "V2" "V3" "V4" "V5" ...
  .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ featureData      :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame':	0 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr(0) 
  .. .. ..@ data             :'data.frame':	54675 obs. of  0 variables
  .. .. ..@ dimLabels        : chr [1:2] "featureNames" "featureColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ annotation       : chr "GPL570"
  ..@ protocolData     :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame':	0 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr(0) 
  .. .. ..@ data             :'data.frame':	24 obs. of  0 variables
  .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. ..@ .Data:List of 4
  .. .. .. ..$ : int [1:3] 3 3 0
  .. .. .. ..$ : int [1:3] 2 32 0
  .. .. .. ..$ : int [1:3] 1 3 0
  .. .. .. ..$ : int [1:3] 1 0 0
In [12]:
colnames(pData(my.geo.gse))
  1. 'title'
  2. 'geo_accession'
  3. 'status'
  4. 'submission_date'
  5. 'last_update_date'
  6. 'type'
  7. 'channel_count'
  8. 'source_name_ch1'
  9. 'organism_ch1'
  10. 'characteristics_ch1'
  11. 'characteristics_ch1.1'
  12. 'characteristics_ch1.2'
  13. 'characteristics_ch1.3'
  14. 'treatment_protocol_ch1'
  15. 'growth_protocol_ch1'
  16. 'molecule_ch1'
  17. 'extract_protocol_ch1'
  18. 'label_ch1'
  19. 'label_protocol_ch1'
  20. 'taxid_ch1'
  21. 'hyb_protocol'
  22. 'scan_protocol'
  23. 'description'
  24. 'data_processing'
  25. 'platform_id'
  26. 'contact_name'
  27. 'contact_email'
  28. 'contact_department'
  29. 'contact_institute'
  30. 'contact_address'
  31. 'contact_city'
  32. 'contact_state'
  33. 'contact_zip/postal_code'
  34. 'contact_country'
  35. 'supplementary_file'
  36. 'data_row_count'
In [13]:
pData(my.geo.gse)$data_processing[1]
V2: Microarray data from CEL files for all 24 chips was normalized simultaneously and expression levels were generated using the gcRMA package in Bioconductor.
In [14]:
head(exprs(my.geo.gse))
GSM400051GSM400052GSM400053GSM400054GSM400055GSM400056GSM400057GSM400058GSM400059GSM400060...GSM400065GSM400066GSM400067GSM400068GSM400069GSM400070GSM400071GSM400072GSM400073GSM400074
1007_s_at10.63166410.54671110.56106010.66515010.67578010.64677410.48355410.57587211.37974811.198587... 10.95281411.20299511.23331111.16382511.26212611.22997011.05971311.31689711.17250911.158658
1053_at10.62391310.42580110.22304010.17371010.54692210.61161810.13522910.16382810.79441010.298981... 10.59123410.77034610.76723810.84581610.77645210.81946810.89031310.58218810.74141010.920544
117_at 2.569199 2.512180 2.494549 2.500320 2.509043 2.622045 2.494224 2.565476 2.515906 2.470113... 2.577456 2.601457 2.504493 2.495275 2.497523 2.484600 2.490614 2.471972 2.502093 2.502444
121_at 2.160805 2.058715 2.109419 2.119830 2.120287 2.133397 2.112262 2.105836 2.117822 2.121420... 2.108140 2.120151 2.121188 2.111085 2.144093 2.101178 2.106881 2.093042 2.114431 2.109476
1255_g_at 2.069468 2.017146 1.998913 1.993419 2.011952 2.028284 1.988180 1.985185 2.023822 1.996898... 2.003733 2.036607 2.010530 2.004277 2.004327 1.988096 1.994639 1.978753 2.008432 2.012706
1294_at 2.858531 2.727086 2.699526 2.682323 2.713459 2.725523 2.601727 2.669674 3.132987 2.688966... 2.703650 3.287122 2.711345 2.705624 2.713630 2.977792 2.626712 2.658313 2.717842 2.690030
In [15]:
summary(exprs(my.geo.gse))
   GSM400051          GSM400052          GSM400053          GSM400054      
 Min.   :-0.04032   Min.   :-0.06048   Min.   :-0.05769   Min.   :-0.0585  
 1st Qu.: 2.44883   1st Qu.: 2.37708   1st Qu.: 2.35818   1st Qu.: 2.3503  
 Median : 3.20058   Median : 3.07566   Median : 3.05323   Median : 3.0401  
 Mean   : 4.67734   Mean   : 4.67873   Mean   : 4.66919   Mean   : 4.6715  
 3rd Qu.: 6.66206   3rd Qu.: 6.92132   3rd Qu.: 6.92480   3rd Qu.: 6.9677  
 Max.   :16.20607   Max.   :16.24553   Max.   :16.21347   Max.   :16.1824  
   GSM400055          GSM400056          GSM400057          GSM400058       
 Min.   :-0.07132   Min.   :-0.06193   Min.   :-0.07322   Min.   :-0.07779  
 1st Qu.: 2.37692   1st Qu.: 2.39200   1st Qu.: 2.34312   1st Qu.: 2.34282  
 Median : 3.08044   Median : 3.11137   Median : 3.03306   Median : 3.02703  
 Mean   : 4.64716   Mean   : 4.64933   Mean   : 4.68469   Mean   : 4.66858  
 3rd Qu.: 6.79471   3rd Qu.: 6.74264   3rd Qu.: 7.02963   3rd Qu.: 6.98022  
 Max.   :16.23244   Max.   :16.21246   Max.   :16.22061   Max.   :16.18700  
   GSM400059          GSM400060          GSM400061          GSM400062       
 Min.   :-0.08268   Min.   :-0.06612   Min.   :-0.07924   Min.   :-0.07219  
 1st Qu.: 2.38973   1st Qu.: 2.35336   1st Qu.: 2.35211   1st Qu.: 2.35103  
 Median : 3.10833   Median : 3.04310   Median : 3.04706   Median : 3.04168  
 Mean   : 4.63192   Mean   : 4.67598   Mean   : 4.65120   Mean   : 4.65916  
 3rd Qu.: 6.67514   3rd Qu.: 6.98008   3rd Qu.: 6.89032   3rd Qu.: 6.92568  
 Max.   :16.26214   Max.   :16.21131   Max.   :16.26009   Max.   :16.25586  
   GSM400063         GSM400064          GSM400065          GSM400066       
 Min.   :-0.0862   Min.   :-0.07695   Min.   :-0.08907   Min.   :-0.08167  
 1st Qu.: 2.3702   1st Qu.: 2.39048   1st Qu.: 2.36268   1st Qu.: 2.41451  
 Median : 3.0713   Median : 3.11363   Median : 3.06338   Median : 3.15442  
 Mean   : 4.6750   Mean   : 4.64014   Mean   : 4.64594   Mean   : 4.58824  
 3rd Qu.: 6.9277   3rd Qu.: 6.68166   3rd Qu.: 6.81570   3rd Qu.: 6.31774  
 Max.   :16.2468   Max.   :16.26531   Max.   :16.25743   Max.   :16.24372  
   GSM400067          GSM400068          GSM400069          GSM400070       
 Min.   :-0.06966   Min.   :-0.07631   Min.   :-0.06768   Min.   :-0.07905  
 1st Qu.: 2.36945   1st Qu.: 2.36281   1st Qu.: 2.36311   1st Qu.: 2.34718  
 Median : 3.07328   Median : 3.06482   Median : 3.05484   Median : 3.04105  
 Mean   : 4.66163   Mean   : 4.66295   Mean   : 4.65384   Mean   : 4.64597  
 3rd Qu.: 6.88171   3rd Qu.: 6.89218   3rd Qu.: 6.86562   3rd Qu.: 6.86173  
 Max.   :16.26009   Max.   :16.26406   Max.   :16.24683   Max.   :16.26732  
   GSM400071         GSM400072          GSM400073          GSM400074       
 Min.   :-0.0821   Min.   :-0.09285   Min.   :-0.08206   Min.   :-0.08166  
 1st Qu.: 2.3548   1st Qu.: 2.33524   1st Qu.: 2.36989   1st Qu.: 2.37066  
 Median : 3.0522   Median : 3.02468   Median : 3.07043   Median : 3.08013  
 Mean   : 4.6281   Mean   : 4.65085   Mean   : 4.64905   Mean   : 4.64660  
 3rd Qu.: 6.7707   3rd Qu.: 6.91728   3rd Qu.: 6.83167   3rd Qu.: 6.80790  
 Max.   :16.2872   Max.   :16.24600   Max.   :16.24548   Max.   :16.25306  

We now have the metadata, phenodata (sample data), and experimental data (microarray intensities) for this experiment. The expression data is in format that could be used for differential gene expression (DGE) analysis. However, it is generally a good idea to start with the raw data and process it yourself. For our example dataset, the expression levels were determined by the gcRMA method. Initially this method was attractive because it attempts to model the effect of GC probe content on hybridization, and thus, apparent gene expression level. However, gcRMA did not perform well on actual--as opposed to simulated--data, so the method wasn't widely adopted. We'll use RMA (Robust Multiarray Average) instead. Currently, it is the most common method to determine probeset expression level for Affymetrix arrays. Another potential reason to reprocess published data is so that you can combine data from different projects.

The function getGEOSuppFiles() downloads the raw data files to your computer. For Affymetrix microarrays, the raw data format is a CEL file.

In [16]:
if(!file.exists(paste0("./geo_downloads/",my.gse)))
getGEOSuppFiles(my.gse, makeDirectory=T, baseDir="geo_downloads")
In [17]:
##list files in working directory
list.files("geo_downloads")
  1. 'GSE10718'
  2. 'GSE10718_series_matrix.txt.gz'
  3. 'GSE15947'
  4. 'GSE15947_series_matrix.txt.gz'
  5. 'GSE56421'
  6. 'GSE56421_series_matrix.txt.gz'
  7. 'GSE72167'
  8. 'GSE72167_series_matrix.txt.gz'
  9. 'GSE75465'
  10. 'GSE75465_series_matrix.txt.gz'
In [18]:
list.files(paste0("geo_downloads/",my.gse))
  1. 'CEL'
  2. 'filelist.txt'
  3. 'GSE15947_RAW.tar'
In [130]:
file.list <- read.delim(paste0("geo_downloads/",my.gse,"/filelist.txt"), as.is=T)
file.list
X.Archive.FileNameTimeSizeType
Archive GSE15947_RAW.tar 01/17/2013 14:43:05121016320 TAR
File GSM400051.CEL.gz 05/04/2009 15:37:24 5146337 CEL
File GSM400052.CEL.gz 05/04/2009 15:37:27 5155871 CEL
File GSM400053.CEL.gz 05/04/2009 15:37:30 5127160 CEL
File GSM400054.CEL.gz 05/04/2009 15:37:33 5144438 CEL
File GSM400055.CEL.gz 05/04/2009 15:37:36 5039072 CEL
File GSM400056.CEL.gz 05/04/2009 15:37:39 5018677 CEL
File GSM400057.CEL.gz 05/04/2009 15:37:42 5159244 CEL
File GSM400058.CEL.gz 05/04/2009 15:37:45 5120104 CEL
File GSM400059.CEL.gz 05/04/2009 15:37:48 5039729 CEL
File GSM400060.CEL.gz 05/04/2009 15:37:51 5097760 CEL
File GSM400061.CEL.gz 05/04/2009 15:37:54 5037262 CEL
File GSM400062.CEL.gz 05/04/2009 15:37:57 5078931 CEL
File GSM400063.CEL.gz 05/04/2009 15:37:59 5075765 CEL
File GSM400064.CEL.gz 05/04/2009 15:38:03 4930236 CEL
File GSM400065.CEL.gz 05/04/2009 15:38:06 4999966 CEL
File GSM400066.CEL.gz 05/04/2009 15:38:09 4916778 CEL
File GSM400067.CEL.gz 05/04/2009 15:38:12 4982129 CEL
File GSM400068.CEL.gz 05/04/2009 15:38:15 4949826 CEL
File GSM400069.CEL.gz 05/04/2009 15:38:17 5037998 CEL
File GSM400070.CEL.gz 05/04/2009 15:38:20 4959679 CEL
File GSM400071.CEL.gz 05/04/2009 15:38:23 4955561 CEL
File GSM400072.CEL.gz 05/04/2009 15:38:26 5058585 CEL
File GSM400073.CEL.gz 05/04/2009 15:38:29 4992002 CEL
File GSM400074.CEL.gz 05/04/2009 15:38:32 4972331 CEL
In [131]:
rm(file.list)
In [17]:
##function creates a directory and downloads the "tarball" that contains
##gzipped CEL files. We need to untar the tarball but we do not need to unzip
##the CEL files. R can read them in compressed format.
untar(paste0("geo_downloads/",my.gse,"/",my.gse,"_RAW.tar"), exdir=paste0("geo_downloads/",my.gse,"/CEL"))
list.files(paste0("geo_downloads/",my.gse,"/CEL"))
  1. 'GSE15947_SelectPhenoData.txt'
  2. 'GSM400051.CEL.gz'
  3. 'GSM400052.CEL.gz'
  4. 'GSM400053.CEL.gz'
  5. 'GSM400054.CEL.gz'
  6. 'GSM400055.CEL.gz'
  7. 'GSM400056.CEL.gz'
  8. 'GSM400057.CEL.gz'
  9. 'GSM400058.CEL.gz'
  10. 'GSM400059.CEL.gz'
  11. 'GSM400060.CEL.gz'
  12. 'GSM400061.CEL.gz'
  13. 'GSM400062.CEL.gz'
  14. 'GSM400063.CEL.gz'
  15. 'GSM400064.CEL.gz'
  16. 'GSM400065.CEL.gz'
  17. 'GSM400066.CEL.gz'
  18. 'GSM400067.CEL.gz'
  19. 'GSM400068.CEL.gz'
  20. 'GSM400069.CEL.gz'
  21. 'GSM400070.CEL.gz'
  22. 'GSM400071.CEL.gz'
  23. 'GSM400072.CEL.gz'
  24. 'GSM400073.CEL.gz'
  25. 'GSM400074.CEL.gz'
In [18]:
my.cels <- list.files(paste0("geo_downloads/",my.gse,"/CEL"), pattern=".CEL")
my.cels
  1. 'GSM400051.CEL.gz'
  2. 'GSM400052.CEL.gz'
  3. 'GSM400053.CEL.gz'
  4. 'GSM400054.CEL.gz'
  5. 'GSM400055.CEL.gz'
  6. 'GSM400056.CEL.gz'
  7. 'GSM400057.CEL.gz'
  8. 'GSM400058.CEL.gz'
  9. 'GSM400059.CEL.gz'
  10. 'GSM400060.CEL.gz'
  11. 'GSM400061.CEL.gz'
  12. 'GSM400062.CEL.gz'
  13. 'GSM400063.CEL.gz'
  14. 'GSM400064.CEL.gz'
  15. 'GSM400065.CEL.gz'
  16. 'GSM400066.CEL.gz'
  17. 'GSM400067.CEL.gz'
  18. 'GSM400068.CEL.gz'
  19. 'GSM400069.CEL.gz'
  20. 'GSM400070.CEL.gz'
  21. 'GSM400071.CEL.gz'
  22. 'GSM400072.CEL.gz'
  23. 'GSM400073.CEL.gz'
  24. 'GSM400074.CEL.gz'
In [19]:
my.cels <- sort(my.cels)
my.cels
  1. 'GSM400051.CEL.gz'
  2. 'GSM400052.CEL.gz'
  3. 'GSM400053.CEL.gz'
  4. 'GSM400054.CEL.gz'
  5. 'GSM400055.CEL.gz'
  6. 'GSM400056.CEL.gz'
  7. 'GSM400057.CEL.gz'
  8. 'GSM400058.CEL.gz'
  9. 'GSM400059.CEL.gz'
  10. 'GSM400060.CEL.gz'
  11. 'GSM400061.CEL.gz'
  12. 'GSM400062.CEL.gz'
  13. 'GSM400063.CEL.gz'
  14. 'GSM400064.CEL.gz'
  15. 'GSM400065.CEL.gz'
  16. 'GSM400066.CEL.gz'
  17. 'GSM400067.CEL.gz'
  18. 'GSM400068.CEL.gz'
  19. 'GSM400069.CEL.gz'
  20. 'GSM400070.CEL.gz'
  21. 'GSM400071.CEL.gz'
  22. 'GSM400072.CEL.gz'
  23. 'GSM400073.CEL.gz'
  24. 'GSM400074.CEL.gz'

Now that we have the CEL files and associated metadata, we can reprocess these microarrays and analyze the results with our preferred methods.

4 Processing Microarray Data

Arguably the first assay platform for biological "big data" was the microarray. Microarrays are mature technology and still in use. Bioconductor provides a host of packages to work with microarray data. We are going to use the package "affy" to process the CEL files and obtain normalized expression measures for DGE analysis.

The first step to process CEL files with "affy" is to tell the program the directory that it can find the CEL files and the corresponding file with phenoData (sample data). Be careful at this step because you don't want to assign the wrong experimental data to a sample!

The following code will make a "data frame" (equivalent to a table) that links the phenodata (sample) to the correct CEL file. The affy package requires that the data frame and the CEL files have the same the rownames. Generally, you need to make some adjustments at this point. This can be done manually (e.g. with Excel), but better to do this programmatically with R so that the process is documented.

4.1 Preparing the Phenodata

In [20]:
##make data frame of phenoData
my.pdata <- as.data.frame(pData(my.geo.gse), stringsAsFactors=F)
head(my.pdata)
titlegeo_accessionstatussubmission_datelast_update_datetypechannel_countsource_name_ch1organism_ch1characteristics_ch1...contact_emailcontact_departmentcontact_institutecontact_addresscontact_citycontact_statecontact_zip/postal_codecontact_countrysupplementary_filedata_row_count
GSM400051RWPE1 cells, vehicle, 6h, biological rep1 GSM400051 Public on Dec 18 2009 May 04 2009 Dec 18 2009 RNA 1 RWPE1 cells treated for 6 hours with vehicle (0.1% ethanol) Homo sapiens cell line: RWPE-1 ... pkovalen@purdue.edu Foods and Nutrition Purdue University 700 W State st West Lafayette IN 47906 USA ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM400nnn/GSM400051/GSM400051.CEL.gz54675
GSM400052RWPE1 cells, vehicle, 6h, biological rep2 GSM400052 Public on Dec 18 2009 May 04 2009 Dec 18 2009 RNA 1 RWPE1 cells treated for 6 hours with vehicle (0.1% ethanol) Homo sapiens cell line: RWPE-1 ... pkovalen@purdue.edu Foods and Nutrition Purdue University 700 W State st West Lafayette IN 47906 USA ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM400nnn/GSM400052/GSM400052.CEL.gz54675
GSM400053RWPE1 cells, vehicle, 6h, biological rep3 GSM400053 Public on Dec 18 2009 May 04 2009 Dec 18 2009 RNA 1 RWPE1 cells treated for 6 hours with vehicle (0.1% ethanol) Homo sapiens cell line: RWPE-1 ... pkovalen@purdue.edu Foods and Nutrition Purdue University 700 W State st West Lafayette IN 47906 USA ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM400nnn/GSM400053/GSM400053.CEL.gz54675
GSM400054RWPE1 cells, vehicle, 6h, biological rep4 GSM400054 Public on Dec 18 2009 May 04 2009 Dec 18 2009 RNA 1 RWPE1 cells treated for 6 hours with vehicle (0.1% ethanol) Homo sapiens cell line: RWPE-1 ... pkovalen@purdue.edu Foods and Nutrition Purdue University 700 W State st West Lafayette IN 47906 USA ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM400nnn/GSM400054/GSM400054.CEL.gz54675
GSM400055RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1 GSM400055 Public on Dec 18 2009 May 04 2009 Dec 18 2009 RNA 1 RWPE1 cells treated for 6 hours with 100 nM 1,25(OH)2D Homo sapiens cell line: RWPE-1 ... pkovalen@purdue.edu Foods and Nutrition Purdue University 700 W State st West Lafayette IN 47906 USA ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM400nnn/GSM400055/GSM400055.CEL.gz54675
GSM400056RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2 GSM400056 Public on Dec 18 2009 May 04 2009 Dec 18 2009 RNA 1 RWPE1 cells treated for 6 hours with 100 nM 1,25(OH)2D Homo sapiens cell line: RWPE-1 ... pkovalen@purdue.edu Foods and Nutrition Purdue University 700 W State st West Lafayette IN 47906 USA ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM400nnn/GSM400056/GSM400056.CEL.gz54675
In [21]:
##unnecessary columns included
dim(my.pdata)
  1. 24
  2. 36
In [22]:
colnames(my.pdata)
  1. 'title'
  2. 'geo_accession'
  3. 'status'
  4. 'submission_date'
  5. 'last_update_date'
  6. 'type'
  7. 'channel_count'
  8. 'source_name_ch1'
  9. 'organism_ch1'
  10. 'characteristics_ch1'
  11. 'characteristics_ch1.1'
  12. 'characteristics_ch1.2'
  13. 'characteristics_ch1.3'
  14. 'treatment_protocol_ch1'
  15. 'growth_protocol_ch1'
  16. 'molecule_ch1'
  17. 'extract_protocol_ch1'
  18. 'label_ch1'
  19. 'label_protocol_ch1'
  20. 'taxid_ch1'
  21. 'hyb_protocol'
  22. 'scan_protocol'
  23. 'description'
  24. 'data_processing'
  25. 'platform_id'
  26. 'contact_name'
  27. 'contact_email'
  28. 'contact_department'
  29. 'contact_institute'
  30. 'contact_address'
  31. 'contact_city'
  32. 'contact_state'
  33. 'contact_zip/postal_code'
  34. 'contact_country'
  35. 'supplementary_file'
  36. 'data_row_count'
In [23]:
head(my.pdata[, c("title", "geo_accession", "description")], 10)
titlegeo_accessiondescription
GSM400051RWPE1 cells, vehicle, 6h, biological rep1 GSM400051 Gene expression of vehicle treated RWPE1 at 6h.
GSM400052RWPE1 cells, vehicle, 6h, biological rep2 GSM400052 Gene expression of vehicle treated RWPE1 at 6h.
GSM400053RWPE1 cells, vehicle, 6h, biological rep3 GSM400053 Gene expression of vehicle treated RWPE1 at 6h.
GSM400054RWPE1 cells, vehicle, 6h, biological rep4 GSM400054 Gene expression of vehicle treated RWPE1 at 6h.
GSM400055RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1GSM400055 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400056RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2GSM400056 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400057RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep3GSM400057 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400058RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep4GSM400058 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400059RWPE1 cells, vehicle, 24h, biological rep1 GSM400059 Gene expression of vehicle treated RWPE1 at 24h.
GSM400060RWPE1 cells, vehicle, 24h, biological rep2 GSM400060 Gene expression of vehicle treated RWPE1 at 24h.
In [24]:
my.pdata <- my.pdata[, c("title", "geo_accession", "description")]
my.pdata <- my.pdata[order(rownames(my.pdata)), ]
head(my.pdata, 10)
titlegeo_accessiondescription
GSM400051RWPE1 cells, vehicle, 6h, biological rep1 GSM400051 Gene expression of vehicle treated RWPE1 at 6h.
GSM400052RWPE1 cells, vehicle, 6h, biological rep2 GSM400052 Gene expression of vehicle treated RWPE1 at 6h.
GSM400053RWPE1 cells, vehicle, 6h, biological rep3 GSM400053 Gene expression of vehicle treated RWPE1 at 6h.
GSM400054RWPE1 cells, vehicle, 6h, biological rep4 GSM400054 Gene expression of vehicle treated RWPE1 at 6h.
GSM400055RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1GSM400055 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400056RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2GSM400056 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400057RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep3GSM400057 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400058RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep4GSM400058 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400059RWPE1 cells, vehicle, 24h, biological rep1 GSM400059 Gene expression of vehicle treated RWPE1 at 24h.
GSM400060RWPE1 cells, vehicle, 24h, biological rep2 GSM400060 Gene expression of vehicle treated RWPE1 at 24h.
In [25]:
##the rownames of the data frame must be the same as the CEL file names.
##For this data, we need to do a bit of tinkering.
rownames(my.pdata) == my.cels
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. FALSE
  6. FALSE
  7. FALSE
  8. FALSE
  9. FALSE
  10. FALSE
  11. FALSE
  12. FALSE
  13. FALSE
  14. FALSE
  15. FALSE
  16. FALSE
  17. FALSE
  18. FALSE
  19. FALSE
  20. FALSE
  21. FALSE
  22. FALSE
  23. FALSE
  24. FALSE
In [26]:
table(rownames(my.pdata) == my.cels)
FALSE 
   24 
In [27]:
head(my.cels)
  1. 'GSM400051.CEL.gz'
  2. 'GSM400052.CEL.gz'
  3. 'GSM400053.CEL.gz'
  4. 'GSM400054.CEL.gz'
  5. 'GSM400055.CEL.gz'
  6. 'GSM400056.CEL.gz'
In [28]:
head(rownames(my.pdata))
  1. 'GSM400051'
  2. 'GSM400052'
  3. 'GSM400053'
  4. 'GSM400054'
  5. 'GSM400055'
  6. 'GSM400056'
In [29]:
temp.rownames <- paste(rownames(my.pdata), ".CEL.gz", sep="")
table(temp.rownames == my.cels)
TRUE 
  24 
In [30]:
rownames(my.pdata) <- temp.rownames
rm(temp.rownames)
table(rownames(my.pdata) == my.cels)
TRUE 
  24 
In [31]:
my.pdata
titlegeo_accessiondescription
GSM400051.CEL.gzRWPE1 cells, vehicle, 6h, biological rep1 GSM400051 Gene expression of vehicle treated RWPE1 at 6h.
GSM400052.CEL.gzRWPE1 cells, vehicle, 6h, biological rep2 GSM400052 Gene expression of vehicle treated RWPE1 at 6h.
GSM400053.CEL.gzRWPE1 cells, vehicle, 6h, biological rep3 GSM400053 Gene expression of vehicle treated RWPE1 at 6h.
GSM400054.CEL.gzRWPE1 cells, vehicle, 6h, biological rep4 GSM400054 Gene expression of vehicle treated RWPE1 at 6h.
GSM400055.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1 GSM400055 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400056.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2 GSM400056 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400057.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep3 GSM400057 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400058.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep4 GSM400058 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400059.CEL.gzRWPE1 cells, vehicle, 24h, biological rep1 GSM400059 Gene expression of vehicle treated RWPE1 at 24h.
GSM400060.CEL.gzRWPE1 cells, vehicle, 24h, biological rep2 GSM400060 Gene expression of vehicle treated RWPE1 at 24h.
GSM400061.CEL.gzRWPE1 cells, vehicle, 24h, biological rep3 GSM400061 Gene expression of vehicle treated RWPE1 at 24h.
GSM400062.CEL.gzRWPE1 cells, vehicle, 24h, biological rep4 GSM400062 Gene expression of vehicle treated RWPE1 at 24h.
GSM400063.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep1GSM400063 Gene expression of 1,25(OH)2D treated RWPE1 at 24h.
GSM400064.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep2GSM400064 Gene expression of 1,25(OH)2D treated RWPE1 at 24h.
GSM400065.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep3GSM400065 Gene expression of 1,25(OH)2D treated RWPE1 at 24h.
GSM400066.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep4GSM400066 Gene expression of 1,25(OH)2D treated RWPE1 at 24h.
GSM400067.CEL.gzRWPE1 cells, vehicle, 48h, biological rep1 GSM400067 Gene expression of vehicle treated RWPE1 at 48h.
GSM400068.CEL.gzRWPE1 cells, vehicle, 48h, biological rep2 GSM400068 Gene expression of vehicle treated RWPE1 at 48h.
GSM400069.CEL.gzRWPE1 cells, vehicle, 48h, biological rep3 GSM400069 Gene expression of vehicle treated RWPE1 at 48h.
GSM400070.CEL.gzRWPE1 cells, vehicle, 48h, biological rep4 GSM400070 Gene expression of vehicle treated RWPE1 at 48h.
GSM400071.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep1GSM400071 Gene expression of 1,25(OH)2D treated RWPE1 at 48h.
GSM400072.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep2GSM400072 Gene expression of 1,25(OH)2D treated RWPE1 at 48h.
GSM400073.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep3GSM400073 Gene expression of 1,25(OH)2D treated RWPE1 at 48h.
GSM400074.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep4GSM400074 Gene expression of 1,25(OH)2D treated RWPE1 at 48h.
In [33]:
write.table(my.pdata, file=paste0("geo_downloads/",my.gse,"/CEL/",my.gse,"_SelectPhenoData.txt"), sep="\t", quote=F)

4.2 Reading the CEL Files

Now that we have directory of CEL files and a corresponding data frame with the phenoData, we can read the CEL files into R with the ReadAffy function. This formats the data as an AffyBatch object that includes the experimental and phenodata.

To simplify downstream analysis, we are going to add some additional columns to the phenoData. One column will have the treatment (e.g. dose, age, time, etc) and another will have simplified labels (treatment and replicate) that we will use in data plots later in the exercise. We could just as easily have added this information to the data frame in the previous section, but we chose to do it this way so that you could learn how to work with the phenoData in the ExpressionSet object.

In [36]:
##Perform affy normalization
library(affy)
cel.path <- paste0("geo_downloads/",my.gse,"/CEL")
my.affy <- ReadAffy(celfile.path=cel.path, phenoData=paste(cel.path, paste0(my.gse,"_SelectPhenoData.txt"), sep="/"))
show(my.affy)
AffyBatch object
size of arrays=1164x1164 features (30 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=24
number of genes=54675
annotation=hgu133plus2
notes=
In [37]:
head(exprs(my.affy))
GSM400051.CEL.gzGSM400052.CEL.gzGSM400053.CEL.gzGSM400054.CEL.gzGSM400055.CEL.gzGSM400056.CEL.gzGSM400057.CEL.gzGSM400058.CEL.gzGSM400059.CEL.gzGSM400060.CEL.gz...GSM400065.CEL.gzGSM400066.CEL.gzGSM400067.CEL.gzGSM400068.CEL.gzGSM400069.CEL.gzGSM400070.CEL.gzGSM400071.CEL.gzGSM400072.CEL.gzGSM400073.CEL.gzGSM400074.CEL.gz
1 106 121 108 94 126 96 107 143 123 124... 86 92 114 89 158 121 112 98 125 114
211114125471113510883107001107213037106791370511078... 11543106691223814641108991206610672115341115811692
3 148 160 142 111 145 133 120 108 155 133... 152 134 135 134 157 130 146 127 136 147
411095131671146211041109621127712916107071400611380... 11661109541252514957111901224710718117911139611893
5 77 72 58 52 58 54 57 62 69 70... 68 48 53 55 57 77 40 54 79 59
6 95 110 102 93 90 99 88 111 117 105... 81 88 72 80 129 103 110 113 116 81
In [38]:
dim(exprs(my.affy))
  1. 1354896
  2. 24
In [39]:
colnames(pData(my.affy))
  1. 'title'
  2. 'geo_accession'
  3. 'description'
In [40]:
pData(my.affy)
titlegeo_accessiondescription
GSM400051.CEL.gzRWPE1 cells, vehicle, 6h, biological rep1 GSM400051 Gene expression of vehicle treated RWPE1 at 6h.
GSM400052.CEL.gzRWPE1 cells, vehicle, 6h, biological rep2 GSM400052 Gene expression of vehicle treated RWPE1 at 6h.
GSM400053.CEL.gzRWPE1 cells, vehicle, 6h, biological rep3 GSM400053 Gene expression of vehicle treated RWPE1 at 6h.
GSM400054.CEL.gzRWPE1 cells, vehicle, 6h, biological rep4 GSM400054 Gene expression of vehicle treated RWPE1 at 6h.
GSM400055.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1 GSM400055 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400056.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2 GSM400056 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400057.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep3 GSM400057 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400058.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep4 GSM400058 Gene expression of 1,25(OH)2D treated RWPE1 at 6h.
GSM400059.CEL.gzRWPE1 cells, vehicle, 24h, biological rep1 GSM400059 Gene expression of vehicle treated RWPE1 at 24h.
GSM400060.CEL.gzRWPE1 cells, vehicle, 24h, biological rep2 GSM400060 Gene expression of vehicle treated RWPE1 at 24h.
GSM400061.CEL.gzRWPE1 cells, vehicle, 24h, biological rep3 GSM400061 Gene expression of vehicle treated RWPE1 at 24h.
GSM400062.CEL.gzRWPE1 cells, vehicle, 24h, biological rep4 GSM400062 Gene expression of vehicle treated RWPE1 at 24h.
GSM400063.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep1GSM400063 Gene expression of 1,25(OH)2D treated RWPE1 at 24h.
GSM400064.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep2GSM400064 Gene expression of 1,25(OH)2D treated RWPE1 at 24h.
GSM400065.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep3GSM400065 Gene expression of 1,25(OH)2D treated RWPE1 at 24h.
GSM400066.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep4GSM400066 Gene expression of 1,25(OH)2D treated RWPE1 at 24h.
GSM400067.CEL.gzRWPE1 cells, vehicle, 48h, biological rep1 GSM400067 Gene expression of vehicle treated RWPE1 at 48h.
GSM400068.CEL.gzRWPE1 cells, vehicle, 48h, biological rep2 GSM400068 Gene expression of vehicle treated RWPE1 at 48h.
GSM400069.CEL.gzRWPE1 cells, vehicle, 48h, biological rep3 GSM400069 Gene expression of vehicle treated RWPE1 at 48h.
GSM400070.CEL.gzRWPE1 cells, vehicle, 48h, biological rep4 GSM400070 Gene expression of vehicle treated RWPE1 at 48h.
GSM400071.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep1GSM400071 Gene expression of 1,25(OH)2D treated RWPE1 at 48h.
GSM400072.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep2GSM400072 Gene expression of 1,25(OH)2D treated RWPE1 at 48h.
GSM400073.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep3GSM400073 Gene expression of 1,25(OH)2D treated RWPE1 at 48h.
GSM400074.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep4GSM400074 Gene expression of 1,25(OH)2D treated RWPE1 at 48h.
In [41]:
pData(my.affy)$title
  1. 'RWPE1 cells, vehicle, 6h, biological rep1'
  2. 'RWPE1 cells, vehicle, 6h, biological rep2'
  3. 'RWPE1 cells, vehicle, 6h, biological rep3'
  4. 'RWPE1 cells, vehicle, 6h, biological rep4'
  5. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1'
  6. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2'
  7. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep3'
  8. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep4'
  9. 'RWPE1 cells, vehicle, 24h, biological rep1'
  10. 'RWPE1 cells, vehicle, 24h, biological rep2'
  11. 'RWPE1 cells, vehicle, 24h, biological rep3'
  12. 'RWPE1 cells, vehicle, 24h, biological rep4'
  13. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep1'
  14. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep2'
  15. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep3'
  16. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep4'
  17. 'RWPE1 cells, vehicle, 48h, biological rep1'
  18. 'RWPE1 cells, vehicle, 48h, biological rep2'
  19. 'RWPE1 cells, vehicle, 48h, biological rep3'
  20. 'RWPE1 cells, vehicle, 48h, biological rep4'
  21. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep1'
  22. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep2'
  23. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep3'
  24. 'RWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep4'
In [42]:
pData(my.affy)$description
  1. 'Gene expression of vehicle treated RWPE1 at 6h.'
  2. 'Gene expression of vehicle treated RWPE1 at 6h.'
  3. 'Gene expression of vehicle treated RWPE1 at 6h.'
  4. 'Gene expression of vehicle treated RWPE1 at 6h.'
  5. 'Gene expression of 1,25(OH)2D treated RWPE1 at 6h.'
  6. 'Gene expression of 1,25(OH)2D treated RWPE1 at 6h.'
  7. 'Gene expression of 1,25(OH)2D treated RWPE1 at 6h.'
  8. 'Gene expression of 1,25(OH)2D treated RWPE1 at 6h.'
  9. 'Gene expression of vehicle treated RWPE1 at 24h.'
  10. 'Gene expression of vehicle treated RWPE1 at 24h.'
  11. 'Gene expression of vehicle treated RWPE1 at 24h.'
  12. 'Gene expression of vehicle treated RWPE1 at 24h.'
  13. 'Gene expression of 1,25(OH)2D treated RWPE1 at 24h.'
  14. 'Gene expression of 1,25(OH)2D treated RWPE1 at 24h.'
  15. 'Gene expression of 1,25(OH)2D treated RWPE1 at 24h.'
  16. 'Gene expression of 1,25(OH)2D treated RWPE1 at 24h.'
  17. 'Gene expression of vehicle treated RWPE1 at 48h.'
  18. 'Gene expression of vehicle treated RWPE1 at 48h.'
  19. 'Gene expression of vehicle treated RWPE1 at 48h.'
  20. 'Gene expression of vehicle treated RWPE1 at 48h.'
  21. 'Gene expression of 1,25(OH)2D treated RWPE1 at 48h.'
  22. 'Gene expression of 1,25(OH)2D treated RWPE1 at 48h.'
  23. 'Gene expression of 1,25(OH)2D treated RWPE1 at 48h.'
  24. 'Gene expression of 1,25(OH)2D treated RWPE1 at 48h.'
In [43]:
table(pData(my.affy)$description)
Gene expression of 1,25(OH)2D treated RWPE1 at 24h. 
                                                  4 
Gene expression of 1,25(OH)2D treated RWPE1 at 48h. 
                                                  4 
 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. 
                                                  4 
   Gene expression of vehicle treated RWPE1 at 24h. 
                                                  4 
   Gene expression of vehicle treated RWPE1 at 48h. 
                                                  4 
    Gene expression of vehicle treated RWPE1 at 6h. 
                                                  4 
In [44]:
##create shorter descriptive levels and labels
pData(my.affy)$sample.levels <- c(rep("C.06", 4), rep("T.06", 4), rep("C.24", 4), rep("T.24", 4), rep("C.48", 4), rep("T.48", 4))
pData(my.affy)$sample.labels <- c(paste("C.06", 1:4, sep="."), paste("T.06", 1:4, sep="."), paste("C.24", 1:4, sep="."), paste("T.24", 1:4, sep="."), paste("C.48", 1:4, sep="."), paste("T.48", 1:4, sep="."))
table(pData(my.affy)$sample.levels)
C.06 C.24 C.48 T.06 T.24 T.48 
   4    4    4    4    4    4 
In [45]:
table(pData(my.affy)$description, pData(my.affy)$sample.levels)
                                                     
                                                      C.06 C.24 C.48 T.06 T.24
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.    0    0    0    0    4
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.    0    0    0    0    0
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.     0    0    0    4    0
  Gene expression of vehicle treated RWPE1 at 24h.       0    4    0    0    0
  Gene expression of vehicle treated RWPE1 at 48h.       0    0    4    0    0
  Gene expression of vehicle treated RWPE1 at 6h.        4    0    0    0    0
                                                     
                                                      T.48
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.    0
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.    4
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.     0
  Gene expression of vehicle treated RWPE1 at 24h.       0
  Gene expression of vehicle treated RWPE1 at 48h.       0
  Gene expression of vehicle treated RWPE1 at 6h.        0
In [46]:
table(pData(my.affy)$description, pData(my.affy)$sample.labels)
                                                     
                                                      C.06.1 C.06.2 C.06.3
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.       0      0      0
  Gene expression of vehicle treated RWPE1 at 24h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 48h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 6h.          1      1      1
                                                     
                                                      C.06.4 C.24.1 C.24.2
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.       0      0      0
  Gene expression of vehicle treated RWPE1 at 24h.         0      1      1
  Gene expression of vehicle treated RWPE1 at 48h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 6h.          1      0      0
                                                     
                                                      C.24.3 C.24.4 C.48.1
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.       0      0      0
  Gene expression of vehicle treated RWPE1 at 24h.         1      1      0
  Gene expression of vehicle treated RWPE1 at 48h.         0      0      1
  Gene expression of vehicle treated RWPE1 at 6h.          0      0      0
                                                     
                                                      C.48.2 C.48.3 C.48.4
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.       0      0      0
  Gene expression of vehicle treated RWPE1 at 24h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 48h.         1      1      1
  Gene expression of vehicle treated RWPE1 at 6h.          0      0      0
                                                     
                                                      T.06.1 T.06.2 T.06.3
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.       1      1      1
  Gene expression of vehicle treated RWPE1 at 24h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 48h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 6h.          0      0      0
                                                     
                                                      T.06.4 T.24.1 T.24.2
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.      0      1      1
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.       1      0      0
  Gene expression of vehicle treated RWPE1 at 24h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 48h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 6h.          0      0      0
                                                     
                                                      T.24.3 T.24.4 T.48.1
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.      1      1      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.      0      0      1
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.       0      0      0
  Gene expression of vehicle treated RWPE1 at 24h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 48h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 6h.          0      0      0
                                                     
                                                      T.48.2 T.48.3 T.48.4
  Gene expression of 1,25(OH)2D treated RWPE1 at 24h.      0      0      0
  Gene expression of 1,25(OH)2D treated RWPE1 at 48h.      1      1      1
  Gene expression of 1,25(OH)2D treated RWPE1 at 6h.       0      0      0
  Gene expression of vehicle treated RWPE1 at 24h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 48h.         0      0      0
  Gene expression of vehicle treated RWPE1 at 6h.          0      0      0
In [47]:
colnames(pData(my.affy))
  1. 'title'
  2. 'geo_accession'
  3. 'description'
  4. 'sample.levels'
  5. 'sample.labels'
In [48]:
pData(my.affy)
titlegeo_accessiondescriptionsample.levelssample.labels
GSM400051.CEL.gzRWPE1 cells, vehicle, 6h, biological rep1 GSM400051 Gene expression of vehicle treated RWPE1 at 6h. C.06 C.06.1
GSM400052.CEL.gzRWPE1 cells, vehicle, 6h, biological rep2 GSM400052 Gene expression of vehicle treated RWPE1 at 6h. C.06 C.06.2
GSM400053.CEL.gzRWPE1 cells, vehicle, 6h, biological rep3 GSM400053 Gene expression of vehicle treated RWPE1 at 6h. C.06 C.06.3
GSM400054.CEL.gzRWPE1 cells, vehicle, 6h, biological rep4 GSM400054 Gene expression of vehicle treated RWPE1 at 6h. C.06 C.06.4
GSM400055.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1 GSM400055 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. T.06 T.06.1
GSM400056.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2 GSM400056 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. T.06 T.06.2
GSM400057.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep3 GSM400057 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. T.06 T.06.3
GSM400058.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep4 GSM400058 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. T.06 T.06.4
GSM400059.CEL.gzRWPE1 cells, vehicle, 24h, biological rep1 GSM400059 Gene expression of vehicle treated RWPE1 at 24h. C.24 C.24.1
GSM400060.CEL.gzRWPE1 cells, vehicle, 24h, biological rep2 GSM400060 Gene expression of vehicle treated RWPE1 at 24h. C.24 C.24.2
GSM400061.CEL.gzRWPE1 cells, vehicle, 24h, biological rep3 GSM400061 Gene expression of vehicle treated RWPE1 at 24h. C.24 C.24.3
GSM400062.CEL.gzRWPE1 cells, vehicle, 24h, biological rep4 GSM400062 Gene expression of vehicle treated RWPE1 at 24h. C.24 C.24.4
GSM400063.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep1GSM400063 Gene expression of 1,25(OH)2D treated RWPE1 at 24h. T.24 T.24.1
GSM400064.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep2GSM400064 Gene expression of 1,25(OH)2D treated RWPE1 at 24h. T.24 T.24.2
GSM400065.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep3GSM400065 Gene expression of 1,25(OH)2D treated RWPE1 at 24h. T.24 T.24.3
GSM400066.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep4GSM400066 Gene expression of 1,25(OH)2D treated RWPE1 at 24h. T.24 T.24.4
GSM400067.CEL.gzRWPE1 cells, vehicle, 48h, biological rep1 GSM400067 Gene expression of vehicle treated RWPE1 at 48h. C.48 C.48.1
GSM400068.CEL.gzRWPE1 cells, vehicle, 48h, biological rep2 GSM400068 Gene expression of vehicle treated RWPE1 at 48h. C.48 C.48.2
GSM400069.CEL.gzRWPE1 cells, vehicle, 48h, biological rep3 GSM400069 Gene expression of vehicle treated RWPE1 at 48h. C.48 C.48.3
GSM400070.CEL.gzRWPE1 cells, vehicle, 48h, biological rep4 GSM400070 Gene expression of vehicle treated RWPE1 at 48h. C.48 C.48.4
GSM400071.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep1GSM400071 Gene expression of 1,25(OH)2D treated RWPE1 at 48h. T.48 T.48.1
GSM400072.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep2GSM400072 Gene expression of 1,25(OH)2D treated RWPE1 at 48h. T.48 T.48.2
GSM400073.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep3GSM400073 Gene expression of 1,25(OH)2D treated RWPE1 at 48h. T.48 T.48.3
GSM400074.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep4GSM400074 Gene expression of 1,25(OH)2D treated RWPE1 at 48h. T.48 T.48.4
In [49]:
##quick visual comparison of pData
cbind(as.character(pData(my.affy)$title), pData(my.affy)$sample.levels, pData(my.affy)$sample.labels)
RWPE1 cells, vehicle, 6h, biological rep1 C.06 C.06.1
RWPE1 cells, vehicle, 6h, biological rep2 C.06 C.06.2
RWPE1 cells, vehicle, 6h, biological rep3 C.06 C.06.3
RWPE1 cells, vehicle, 6h, biological rep4 C.06 C.06.4
RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1 T.06 T.06.1
RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2 T.06 T.06.2
RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep3 T.06 T.06.3
RWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep4 T.06 T.06.4
RWPE1 cells, vehicle, 24h, biological rep1 C.24 C.24.1
RWPE1 cells, vehicle, 24h, biological rep2 C.24 C.24.2
RWPE1 cells, vehicle, 24h, biological rep3 C.24 C.24.3
RWPE1 cells, vehicle, 24h, biological rep4 C.24 C.24.4
RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep1T.24 T.24.1
RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep2T.24 T.24.2
RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep3T.24 T.24.3
RWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep4T.24 T.24.4
RWPE1 cells, vehicle, 48h, biological rep1 C.48 C.48.1
RWPE1 cells, vehicle, 48h, biological rep2 C.48 C.48.2
RWPE1 cells, vehicle, 48h, biological rep3 C.48 C.48.3
RWPE1 cells, vehicle, 48h, biological rep4 C.48 C.48.4
RWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep1T.48 T.48.1
RWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep2T.48 T.48.2
RWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep3T.48 T.48.3
RWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep4T.48 T.48.4

4.3 Calculating Gene Expression Measurements

Currently, the expression data consists of 1354896 individual probe intensities for each of our 24 samples. We need to combine the individual probe intensities to probeset-level (gene-level) measurements. Each probeset typically consists of 11 - 20 individual probes. We will use the rma() function to combine the individual probe intensities to a probeset intensity. A CDF file is required for this process. The ReadAffy function checks for the presence of the CDF that corresponds to any CEL files that are read, and will download the file from Bioconductor, if necessary.

Typically, microarrays are normalized and background-corrected at this point. For our first pass with rma(), we are going to skip the normalization and background correction to illustrate the need for the process. Note that this expression measure is given to you in log base 2 scale.

In [50]:
##Calculate gene level expression measures
my.rma <- rma(my.affy, normalize=F, background=F)
head(exprs(my.rma))
Calculating Expression
GSM400051.CEL.gzGSM400052.CEL.gzGSM400053.CEL.gzGSM400054.CEL.gzGSM400055.CEL.gzGSM400056.CEL.gzGSM400057.CEL.gzGSM400058.CEL.gzGSM400059.CEL.gzGSM400060.CEL.gz...GSM400065.CEL.gzGSM400066.CEL.gzGSM400067.CEL.gzGSM400068.CEL.gzGSM400069.CEL.gzGSM400070.CEL.gzGSM400071.CEL.gzGSM400072.CEL.gzGSM400073.CEL.gzGSM400074.CEL.gz
1007_s_at11.22387111.25504511.26208711.35309011.05291010.94985311.28504611.19919211.55903011.585237... 11.21483311.18310811.46983511.33878411.57381011.40655111.31656311.59690511.37440111.336891
1053_at10.13493610.111872 9.908616 9.881009 9.874648 9.904876 9.971198 9.83374110.046258 9.900686... 9.973361 9.91331010.07557610.10696810.18088010.06734010.00789210.025962 9.99367110.082639
117_at 7.236489 7.157655 7.099577 7.153844 7.180233 7.122977 7.312232 7.274220 7.062687 7.048363... 7.031753 7.018723 6.879042 6.802174 6.905630 6.818091 6.831463 6.991043 6.916990 6.810282
121_at 9.053278 9.014102 8.866924 8.877431 8.842456 8.884724 8.854481 8.846868 8.773581 8.827094... 8.645635 8.719617 8.641840 8.550230 8.822267 8.555913 8.641865 8.789322 8.649763 8.613888
1255_g_at 6.096554 5.989879 6.017855 5.958883 5.926033 5.964681 6.006705 5.955018 5.978351 6.004034... 5.916336 5.773216 5.832242 5.797502 5.764048 5.823830 5.885693 5.942960 5.935771 5.808373
1294_at 7.534898 7.610652 7.348155 7.493571 7.355708 7.339656 7.340311 7.379971 7.568454 7.464183... 7.191482 7.317896 7.346674 7.361718 7.551624 7.401136 7.152861 7.241561 7.284504 7.206867
In [51]:
dim(exprs(my.rma))
  1. 54675
  2. 24
In [52]:
dim(exprs(my.affy))
  1. 1354896
  2. 24
In [53]:
pData(my.rma)
titlegeo_accessiondescriptionsample.levelssample.labels
GSM400051.CEL.gzRWPE1 cells, vehicle, 6h, biological rep1 GSM400051 Gene expression of vehicle treated RWPE1 at 6h. C.06 C.06.1
GSM400052.CEL.gzRWPE1 cells, vehicle, 6h, biological rep2 GSM400052 Gene expression of vehicle treated RWPE1 at 6h. C.06 C.06.2
GSM400053.CEL.gzRWPE1 cells, vehicle, 6h, biological rep3 GSM400053 Gene expression of vehicle treated RWPE1 at 6h. C.06 C.06.3
GSM400054.CEL.gzRWPE1 cells, vehicle, 6h, biological rep4 GSM400054 Gene expression of vehicle treated RWPE1 at 6h. C.06 C.06.4
GSM400055.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep1 GSM400055 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. T.06 T.06.1
GSM400056.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep2 GSM400056 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. T.06 T.06.2
GSM400057.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep3 GSM400057 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. T.06 T.06.3
GSM400058.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 6h, biological rep4 GSM400058 Gene expression of 1,25(OH)2D treated RWPE1 at 6h. T.06 T.06.4
GSM400059.CEL.gzRWPE1 cells, vehicle, 24h, biological rep1 GSM400059 Gene expression of vehicle treated RWPE1 at 24h. C.24 C.24.1
GSM400060.CEL.gzRWPE1 cells, vehicle, 24h, biological rep2 GSM400060 Gene expression of vehicle treated RWPE1 at 24h. C.24 C.24.2
GSM400061.CEL.gzRWPE1 cells, vehicle, 24h, biological rep3 GSM400061 Gene expression of vehicle treated RWPE1 at 24h. C.24 C.24.3
GSM400062.CEL.gzRWPE1 cells, vehicle, 24h, biological rep4 GSM400062 Gene expression of vehicle treated RWPE1 at 24h. C.24 C.24.4
GSM400063.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep1GSM400063 Gene expression of 1,25(OH)2D treated RWPE1 at 24h. T.24 T.24.1
GSM400064.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep2GSM400064 Gene expression of 1,25(OH)2D treated RWPE1 at 24h. T.24 T.24.2
GSM400065.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep3GSM400065 Gene expression of 1,25(OH)2D treated RWPE1 at 24h. T.24 T.24.3
GSM400066.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 24h, biological rep4GSM400066 Gene expression of 1,25(OH)2D treated RWPE1 at 24h. T.24 T.24.4
GSM400067.CEL.gzRWPE1 cells, vehicle, 48h, biological rep1 GSM400067 Gene expression of vehicle treated RWPE1 at 48h. C.48 C.48.1
GSM400068.CEL.gzRWPE1 cells, vehicle, 48h, biological rep2 GSM400068 Gene expression of vehicle treated RWPE1 at 48h. C.48 C.48.2
GSM400069.CEL.gzRWPE1 cells, vehicle, 48h, biological rep3 GSM400069 Gene expression of vehicle treated RWPE1 at 48h. C.48 C.48.3
GSM400070.CEL.gzRWPE1 cells, vehicle, 48h, biological rep4 GSM400070 Gene expression of vehicle treated RWPE1 at 48h. C.48 C.48.4
GSM400071.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep1GSM400071 Gene expression of 1,25(OH)2D treated RWPE1 at 48h. T.48 T.48.1
GSM400072.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep2GSM400072 Gene expression of 1,25(OH)2D treated RWPE1 at 48h. T.48 T.48.2
GSM400073.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep3GSM400073 Gene expression of 1,25(OH)2D treated RWPE1 at 48h. T.48 T.48.3
GSM400074.CEL.gzRWPE1 cells, 100 nM of 1,25(OH)2D, 48h, biological rep4GSM400074 Gene expression of 1,25(OH)2D treated RWPE1 at 48h. T.48 T.48.4
In [54]:
##make sample.levels a factor making C.06 the reference.
pData(my.rma)$sample.levels <- as.factor(pData(my.rma)$sample.levels)
pData(my.rma)$sample.levels <- relevel(pData(my.rma)$sample.levels, ref="C.06")
levels(pData(my.rma)$sample.levels)
  1. 'C.06'
  2. 'C.24'
  3. 'C.48'
  4. 'T.06'
  5. 'T.24'
  6. 'T.48'
In [55]:
plotDensity(exprs(my.rma))
In [56]:
##Load limma to get plotDensities function. Load RColorBrewer to provide good
##color palettes.
library(limma)
library(RColorBrewer)
##show palettes available with RColorBrewer
display.brewer.all()
Attaching package: 'limma'

The following object is masked from 'package:BiocGenerics':

    plotMA

In [57]:
level.pal <- brewer.pal(6, "Dark2")
level.pal
  1. '#1B9E77'
  2. '#D95F02'
  3. '#7570B3'
  4. '#E7298A'
  5. '#66A61E'
  6. '#E6AB02'
In [58]:
##sample.levels is a factor. unname will change this to integer data
##to easily assign colors to the various levels.
level.cols <- level.pal[unname(pData(my.rma)$sample.levels)]
level.cols
  1. '#1B9E77'
  2. '#1B9E77'
  3. '#1B9E77'
  4. '#1B9E77'
  5. '#E7298A'
  6. '#E7298A'
  7. '#E7298A'
  8. '#E7298A'
  9. '#D95F02'
  10. '#D95F02'
  11. '#D95F02'
  12. '#D95F02'
  13. '#66A61E'
  14. '#66A61E'
  15. '#66A61E'
  16. '#66A61E'
  17. '#7570B3'
  18. '#7570B3'
  19. '#7570B3'
  20. '#7570B3'
  21. '#E6AB02'
  22. '#E6AB02'
  23. '#E6AB02'
  24. '#E6AB02'
In [59]:
##Density plot of rma values show need for normalization
plotDensities(exprs(my.rma), legend=F, col=level.cols, main="Arrays Not Normalized")
legend("topright", legend=levels(pData(my.rma)$sample.levels), fill=level.pal)
In [60]:
pdf(file="results/DensityNoNorm.pdf", w=6, h=6)
plotDensities(exprs(my.rma), legend=F, col=level.cols, main="Arrays Not Normalized")
legend("topright", legend=levels(pData(my.rma)$sample.levels), fill=level.pal)
dev.off()
png: 2
In [61]:
##Boxplot is another way of showing the same thing.
boxplot(exprs(my.rma), las=2, names=pData(my.rma)$sample.labels, outline=F, col=level.cols, main="Arrays Not Normalized")
In [62]:
pdf(file="results/DensityNoNorm.pdf", w=6, h=6)
boxplot(exprs(my.rma), las=2, names=pData(my.rma)$sample.labels, outline=F, col=level.cols, main="Arrays Not Normalized")
dev.off()
png: 2
In [63]:
##Now use normalization and background correction
my.rma <- rma(my.affy, normalize=T, background=T)
plotDensities(exprs(my.rma), legend=F, col=level.cols, main="Arrays Normalized")
Background correcting
Normalizing
Calculating Expression
In [64]:
pdf(file="results/DensityNorm.pdf", w=6, h=6)
plotDensities(exprs(my.rma), legend=F, col=level.cols, main="Arrays Normalized")
dev.off()
png: 2
In [65]:
boxplot(exprs(my.rma), las=2, names=pData(my.rma)$sample.labels, outline=F, col=level.cols, main="Arrays Normalized")
In [66]:
pdf(file="results/DensityNorm.pdf", w=6, h=6)
boxplot(exprs(my.rma), las=2, names=pData(my.rma)$sample.labels, outline=F, col=level.cols, main="Arrays Normalized")
dev.off()
png: 2
In [68]:
##save rma values
write.table(exprs(my.rma), file=paste0("results/",my.gse,"_RMA_Norm.txt"), sep="\t", quote=FALSE)

4.4 Filtering the Data

A HUGE problem in "omics" is the statistical problem of type I statistical errors (false positives) that occur when a researcher conducts many independent statistical tests. There are special statistical methods that have been developed to control for type I and type II (false negative) errors and we'll pause here to consider this issue.

One simple way to reduce the possibility for type I errors is to do fewer comparisons. This can be done by filtering the data. For example, we know that not all genes are expressed in all tissues and many genes will not be expressed in any sample. As a result, DGE analysis, it makes sense to remove probesets for genes that likely not expressed at all. There are several ways to do this. Here, we are going to use a function called "mas5calls" that takes the probe level information in the AffyBatch object and makes qualitative calls of gene expression: P = present, M = marginal, and A = absent. We can then filter our ExpressionSet object with these calls.

Thought experiment.....give the experimental design of our sample dataset, what type of P/A filter makes sense?

In [69]:
my.calls <- mas5calls(my.affy)
head(exprs(my.calls))
Getting probe level data...
Computing p-values
Making P/M/A Calls
GSM400051.CEL.gzGSM400052.CEL.gzGSM400053.CEL.gzGSM400054.CEL.gzGSM400055.CEL.gzGSM400056.CEL.gzGSM400057.CEL.gzGSM400058.CEL.gzGSM400059.CEL.gzGSM400060.CEL.gz...GSM400065.CEL.gzGSM400066.CEL.gzGSM400067.CEL.gzGSM400068.CEL.gzGSM400069.CEL.gzGSM400070.CEL.gzGSM400071.CEL.gzGSM400072.CEL.gzGSM400073.CEL.gzGSM400074.CEL.gz
1007_s_atP P P P P P P P P P ...P P P P P P P P P P
1053_atP P P P P P P P P P ...P P P P P P P P P P
117_atA A A A P A P P A A ...A M P A M A M A A A
121_atP P P P P P P P P P ...P P P P P P P P P P
1255_g_atA A A A A A A A A A ...A A A A A A A A A A
1294_atP P M P A P A P P M ...P P P P P P A A P P
In [70]:
table(exprs(my.calls)[, 1])
    A     M     P 
31208  1582 21885 
In [71]:
##the next expression will work down the rows of the data matrix and 
##return the number of samples in which the gene has been called as present.
present <- apply(exprs(my.calls), 1, function(x)(sum(x == "P")))
head(present)
1007_s_at
24
1053_at
24
117_at
4
121_at
24
1255_g_at
0
1294_at
16
In [72]:
table(present)
present
    0     1     2     3     4     5     6     7     8     9    10    11    12 
21638  3247  1767  1168   897   712   620   549   509   393   435   363   319 
   13    14    15    16    17    18    19    20    21    22    23    24 
  340   338   347   337   338   391   419   424   537   800  1602 16185 
In [73]:
prop.table(table(present)) * 100
present
         0          1          2          3          4          5          6 
39.5756744  5.9387289  3.2318244  2.1362597  1.6406036  1.3022405  1.1339735 
         7          8          9         10         11         12         13 
 1.0041152  0.9309556  0.7187929  0.7956104  0.6639232  0.5834476  0.6218564 
        14         15         16         17         18         19         20 
 0.6181984  0.6346594  0.6163695  0.6181984  0.7151349  0.7663466  0.7754915 
        21         22         23         24 
 0.9821674  1.4631916  2.9300412 29.6021948 

The number of biological replicates for our example study is four. If a gene is expressed in only one treatment then a reasonable minimum number of present calls to accept is four. If we wanted to be very stringent, we could require that the four samples all occur within one treatment group, but we are going to allow this minimum number of present calls to occur in any combination of sample. We will also examine the intensities for these genes above and below this cutoff to confirm that we are simply removing genes with low expression measurements. What other types of cutoffs make sense?

In [74]:
table(present >= 4)
FALSE  TRUE 
27820 26855 
In [75]:
##We are going to discard more than half of the genes on the array.
plotDensities(exprs(my.rma)[present >= 4, ], col=level.cols, legend=F, main="Present >= 4")
In [76]:
plotDensities(exprs(my.rma)[present < 4, ], col=level.cols, legend=F, main="Present < 4")

In the next code section, we will remove the probesets that correspond to genes that are not expressed. At the same time, we are going to remove two other types of probesets: (1) Affymetrix controls that are flagged like this, AFFX and (2) probesets that likely cross-hybridize with multiple genes (Affymetrix flags these probeset with these characters: _x_). In both cases, we can use the function grep() to look for partial matches in the rownames of the expression matrix.

In [77]:
##Clean data
dim(my.rma)
Features
54675
Samples
24
In [78]:
##remove probesets for genes that are not expressed
exprs(my.rma) <- exprs(my.rma)[present >= 4, ]
dim(my.rma)
Features
26855
Samples
24
In [79]:
##remove probesets that can crosshybridize with other genes
exprs(my.rma) <- exprs(my.rma)[-grep("_x_", rownames(exprs(my.rma))), ]
dim(my.rma)
Features
24538
Samples
24
In [80]:
##remove probesets that are Affymetrix controls
exprs(my.rma) <- exprs(my.rma)[-grep("AFFX", rownames(exprs(my.rma))), ]
dim(my.rma)
Features
24503
Samples
24
In [82]:
pData(my.rma)$sample.levels <- as.factor(pData(my.rma)$sample.levels)
pData(my.rma)$sample.levels <- relevel(pData(my.rma)$sample.levels, ref="C.06")
write.table(exprs(my.rma), file=paste0("results/",my.gse,"_RMA_Norm_Filter.txt"), sep="\t", quote=FALSE)

We are now ready to proceed with Differential Gene Expression (DGE) analysis.

5 DGE with Limma

The next sections of code are for DGE. There are many ways to do this but a common method is "limma" (Linear Models for MicroArray). The first section will conduct QC analysis to compare our replicates.

Microarray hybridizations are susceptible to many techncial problems such as poor signal, physical abrasions or inadequate washing. There are tools to create a "false" image of the microarray from the CEL file. Here, we are mainly checking for consistency between biological replicates of the same treatment.

5.1 Additional QC to Assess Replicates

In [83]:
plotMDS(exprs(my.rma), labels=pData(my.rma)$sample.labels, top=500, gene.selection="common", main="MDS Plot to Compare Replicates")
In [84]:
pdf(file="results/MDS_plot.pdf", w=6, h=6)
plotMDS(exprs(my.rma), labels=pData(my.rma)$sample.labels, top=500, gene.selection="common", main="MDS Plot to Compare Replicates")
dev.off()
png: 2

There are some reasons for concern with this data set. All 4 samples for Treated Cells at 6 hours cluster reasonably close together as do the Treated and Control Cells at 48 hours. However, there is evidence of problems with some replicates, e.g. T24.4 is off on it's own.

We can visualize the correlations between the arrays by hierarchical clustering. First, we need to scale the rma values so that highly-expressed gene do not dominate the result. To do this, we are going to produce a matrix of gene-specific z-scores from the rma values. To do this, you must calculate the mean expression and standard deviation for each gene across all samples. The z-score is calculated with the following formula:

z-score = (rma-sample - rma-mean) / rma-sd

Thus, the z-score indicates how many standard deviations an observed rma value is above or below the mean.

In [85]:
cluster.dat <- exprs(my.rma)
gene.mean <- apply(cluster.dat, 1, mean)
gene.sd <- apply(cluster.dat, 1, sd)
cluster.dat <- sweep(cluster.dat, 1, gene.mean, "-")
cluster.dat <- sweep(cluster.dat, 1, gene.sd, "/")
In [86]:
my.dist <- dist(t(cluster.dat), method="euclidean")
my.hclust <- hclust(my.dist, method="average")
my.hclust$labels <- pData(my.rma)$sample.labels
plot(my.hclust, cex=0.75, main="Comparison of Biological Replicates", xlab="Euclidean Distance")

With 24 samples, the dendrogram is a little easier to interpret. T24.4 is still a clear outlier, but we can see many treatments with 3 or even all 4 replicate clustering together. We used the Euclidean distance metric above. We can try the Pearson distance metric which is good for analyzing trends in gene expression.

In [87]:
my.cor <- cor(cluster.dat)
my.dist <- as.dist(1 - my.cor)
my.hclust <- hclust(my.dist, method="average")
my.hclust$labels <- pData(my.rma)$sample.labels
plot(my.hclust, cex=0.75, main="Comparison of Biological Replicates")
In [88]:
pdf(file="results/Cluster_plot.pdf")
plot(my.hclust, cex=0.75, main="Comparison of Biological Replicates", xlab="Pearson Distance")
dev.off()
png: 2

The result of our QC is that there is some concern for this experiment. There are several replicate arrays that do not agree well with their mates. At this point, there is no harm proceeding with DGE analysis using all arrays.

5.2 Specifying a Design for the Experiment

The next step in the analysis is to describe the experiment and analysis for limma in the form of a matrix. The matrix must have one row for every microarray and one column for each of the coefficients or experimental effects that you want to resolve.

What experimental effects should we resolve? For this data, we have one treatment (vitamin D) and three time points. We could specify a design where we tried to estimate the vitamin D effect and time effect separately (as well as any interaction). However, we are going to treat our samples categorically, i.e. we are going to combine the vitamin D treatment and time into a single coefficient. This means we will specify a coefficient for each sample level. Additionally, we are not going to specify a fixed reference to which all samples will be compared. This will allow us to specify pairwise contrasts later.

In [89]:
my.design <- model.matrix(~0 + sample.levels, pData(my.rma))
my.design
sample.levelsC.06sample.levelsC.24sample.levelsC.48sample.levelsT.06sample.levelsT.24sample.levelsT.48
GSM400051.CEL.gz100000
GSM400052.CEL.gz100000
GSM400053.CEL.gz100000
GSM400054.CEL.gz100000
GSM400055.CEL.gz000100
GSM400056.CEL.gz000100
GSM400057.CEL.gz000100
GSM400058.CEL.gz000100
GSM400059.CEL.gz010000
GSM400060.CEL.gz010000
GSM400061.CEL.gz010000
GSM400062.CEL.gz010000
GSM400063.CEL.gz000010
GSM400064.CEL.gz000010
GSM400065.CEL.gz000010
GSM400066.CEL.gz000010
GSM400067.CEL.gz001000
GSM400068.CEL.gz001000
GSM400069.CEL.gz001000
GSM400070.CEL.gz001000
GSM400071.CEL.gz000001
GSM400072.CEL.gz000001
GSM400073.CEL.gz000001
GSM400074.CEL.gz000001
In [90]:
rownames(my.design) <- pData(my.rma)$sample.labels
my.design
sample.levelsC.06sample.levelsC.24sample.levelsC.48sample.levelsT.06sample.levelsT.24sample.levelsT.48
C.06.1100000
C.06.2100000
C.06.3100000
C.06.4100000
T.06.1000100
T.06.2000100
T.06.3000100
T.06.4000100
C.24.1010000
C.24.2010000
C.24.3010000
C.24.4010000
T.24.1000010
T.24.2000010
T.24.3000010
T.24.4000010
C.48.1001000
C.48.2001000
C.48.3001000
C.48.4001000
T.48.1000001
T.48.2000001
T.48.3000001
T.48.4000001
In [91]:
colnames(my.design) <- levels(pData(my.rma)$sample.levels)
my.design
C.06C.24C.48T.06T.24T.48
C.06.1100000
C.06.2100000
C.06.3100000
C.06.4100000
T.06.1000100
T.06.2000100
T.06.3000100
T.06.4000100
C.24.1010000
C.24.2010000
C.24.3010000
C.24.4010000
T.24.1000010
T.24.2000010
T.24.3000010
T.24.4000010
C.48.1001000
C.48.2001000
C.48.3001000
C.48.4001000
T.48.1000001
T.48.2000001
T.48.3000001
T.48.4000001

5.3 Fitting the Coefficients

Now that we have a design matrix, we need to estimate the coefficients. For this design, we will essentially average the replicate arrays for each sample level. In addition, we will calculate standard deviations for each gene, and the average intensity for the genes across all microarrays.

In [92]:
##determine the average effect (coefficient) for each treatment
my.fit <- lmFit(my.rma, my.design)
my.fit
An object of class "MArrayLM"
$coefficients
               C.06      C.24      C.48      T.06      T.24      T.48
1007_s_at 11.052573 11.515919 11.475880 11.007555 11.283394 11.445712
1053_at    9.693914  9.817586 10.113618  9.736016  9.961417 10.055734
117_at     6.040556  5.965762  5.964568  6.298423  6.165481  5.966755
121_at     8.548789  8.534100  8.516500  8.587753  8.602839  8.561991
1294_at    6.607870  6.730427  6.895126  6.515623  6.702186  6.568955
24498 more rows ...

$rank
[1] 6

$assign
[1] 1 1 1 1 1 1

$qr
$qr
       C.06 C.24 C.48 T.06 T.24 T.48
C.06.1 -2.0    0    0  0.0    0    0
C.06.2  0.5   -2    0  0.0    0    0
C.06.3  0.5    0   -2  0.0    0    0
C.06.4  0.5    0    0 -2.0    0    0
T.06.1  0.0    0    0  0.5   -2    0
19 more rows ...

$qraux
[1] 1.5 1.0 1.0 1.0 1.0 1.0

$pivot
[1] 1 2 3 4 5 6

$tol
[1] 1e-07

$rank
[1] 6


$df.residual
[1] 18 18 18 18 18
24498 more elements ...

$sigma
 1007_s_at    1053_at     117_at     121_at    1294_at 
0.06312645 0.13052707 0.12118351 0.14320464 0.18807920 
24498 more elements ...

$cov.coefficients
     C.06 C.24 C.48 T.06 T.24 T.48
C.06 0.25 0.00 0.00 0.00 0.00 0.00
C.24 0.00 0.25 0.00 0.00 0.00 0.00
C.48 0.00 0.00 0.25 0.00 0.00 0.00
T.06 0.00 0.00 0.00 0.25 0.00 0.00
T.24 0.00 0.00 0.00 0.00 0.25 0.00
T.48 0.00 0.00 0.00 0.00 0.00 0.25

$stdev.unscaled
          C.06 C.24 C.48 T.06 T.24 T.48
1007_s_at  0.5  0.5  0.5  0.5  0.5  0.5
1053_at    0.5  0.5  0.5  0.5  0.5  0.5
117_at     0.5  0.5  0.5  0.5  0.5  0.5
121_at     0.5  0.5  0.5  0.5  0.5  0.5
1294_at    0.5  0.5  0.5  0.5  0.5  0.5
24498 more rows ...

$pivot
[1] 1 2 3 4 5 6

$Amean
1007_s_at   1053_at    117_at    121_at   1294_at 
11.296839  9.896381  6.066924  8.558662  6.670031 
24498 more elements ...

$method
[1] "ls"

$design
       C.06 C.24 C.48 T.06 T.24 T.48
C.06.1    1    0    0    0    0    0
C.06.2    1    0    0    0    0    0
C.06.3    1    0    0    0    0    0
C.06.4    1    0    0    0    0    0
T.06.1    0    0    0    1    0    0
19 more rows ...
In [93]:
write.table(my.fit$coefficients, file=paste0("results/",my.gse,"_Limma_Coeff.txt"), sep="\t", quote=F)

5.4 Making Comparisons (Contrasts) Between Samples

Now we are ready to tell limma which pairwise contrasts that we want to make. With six coefficients, we have 15 possible pairwise combinations that do not include recipricol comparisons or self comparisons. However, do all of these 15 makes sense? For example, do we really need to compare T.48 to C.24? Similar to the design matrix, limma needs a contrast matrix that specifies how the contrasts related to the coefficients. For this experiment, we are going to contrast treated and control at each time point. Additionally, we are going to do contrasts between time points within treated or control. This leaves nine contrasts to specify.

We are going to use the capability of R to work down lists to process each of these contrasts with a single expression. To do this, the contrast.fit function is nested inside an sapply.

In [94]:
##specify the contrast of interest using the levels from the design matrix
my.contrasts <- makeContrasts(C24vC06 = C.24 - C.06, C48vC06 = C.48 - C.06, C48vC24= C.48 - C.24, T24vT06 = T.24 - T.06, T48vT06 = T.48 - T.06, T48vT24 = T.48 - T.24, T06vC06 = T.06 - C.06, T24vC24 = T.24 - C.24, T48vC48 = T.48 - C.48, levels = my.design)
my.contrasts
C24vC06C48vC06C48vC24T24vT06T48vT06T48vT24T06vC06T24vC24T48vC48
C.06-1-1 0 0 0 0-1 0 0
C.24 1 0-1 0 0 0 0-1 0
C.48 0 1 1 0 0 0 0 0-1
T.06 0 0 0-1-1 0 1 0 0
T.24 0 0 0 1 0-1 0 1 0
T.48 0 0 0 0 1 1 0 0 1
In [95]:
contrast.fits <- sapply(colnames(my.contrasts), function(x)(contrasts.fit(my.fit, contrasts=my.contrasts[, x])))
length(contrast.fits)
9
In [96]:
names(contrast.fits)
  1. 'C24vC06'
  2. 'C48vC06'
  3. 'C48vC24'
  4. 'T24vT06'
  5. 'T48vT06'
  6. 'T48vT24'
  7. 'T06vC06'
  8. 'T24vC24'
  9. 'T48vC48'
In [97]:
contrast.fits["C24vC06"]
$C24vC06
An object of class "MArrayLM"
$coefficients
  1007_s_at     1053_at      117_at      121_at     1294_at 
 0.46334611  0.12367214 -0.07479366 -0.01468879  0.12255676 
24498 more rows ...

$rank
[1] 6

$assign
[1] 1 1 1 1 1 1

$qr
$qr
       C.06 C.24 C.48 T.06 T.24 T.48
C.06.1 -2.0    0    0  0.0    0    0
C.06.2  0.5   -2    0  0.0    0    0
C.06.3  0.5    0   -2  0.0    0    0
C.06.4  0.5    0    0 -2.0    0    0
T.06.1  0.0    0    0  0.5   -2    0
19 more rows ...

$qraux
[1] 1.5 1.0 1.0 1.0 1.0 1.0

$pivot
[1] 1 2 3 4 5 6

$tol
[1] 1e-07

$rank
[1] 6


$df.residual
[1] 18 18 18 18 18
24498 more elements ...

$sigma
 1007_s_at    1053_at     117_at     121_at    1294_at 
0.06312645 0.13052707 0.12118351 0.14320464 0.18807920 
24498 more elements ...

$cov.coefficients
     [,1]
[1,]  0.5

$stdev.unscaled
1007_s_at   1053_at    117_at    121_at   1294_at 
0.7071068 0.7071068 0.7071068 0.7071068 0.7071068 
24498 more rows ...

$Amean
1007_s_at   1053_at    117_at    121_at   1294_at 
11.296839  9.896381  6.066924  8.558662  6.670031 
24498 more elements ...

$method
[1] "ls"

$design
       C.06 C.24 C.48 T.06 T.24 T.48
C.06.1    1    0    0    0    0    0
C.06.2    1    0    0    0    0    0
C.06.3    1    0    0    0    0    0
C.06.4    1    0    0    0    0    0
T.06.1    0    0    0    1    0    0
19 more rows ...

$contrasts
     [,1]
C.06   -1
C.24    1
C.48    0
T.06    0
T.24    0
T.48    0

5.5 Performing the Statistical Tests

Finally, we are ready to do our statictical comparisons. In theory, you could do 24503 individual T-tests to check for diffential expression for any of our contrasts. However, how would you control for multiple testing and control your type I error. Limma uses Bayesian statistics to minimize this problem. The details are beyond the scope of this course, but essentially, limma borrows information across all genes to perform a modified T-test. The eBayes function performs the tests, and there are several parameters (arguments) that can be changed. We are going to change only the proportion of genes that we expect to be differentially expressed. The default is 1% which we are going to change to 10%. Later, you can try altering some of these arguments to see how it effects the DGE analysis.

We are also going to summarize the results of the statistical test with two additional functions. topTable will adjust the p-values and return the top genes that meet the cutoffs that you supply as arguments. To document the analysis, I always return the results for all genes and not sorted. If you are just exploring your data, feel free to limit the number and sort as you desire.

Finally, decideTests will make calls for DEGs by adjusting the p-values and applying a logFC cutoff similar to topTable. For our current design, we don't have any flexibility in the method. Each contrast.eb has only one coefficient so there is no way to combine p-values across multiple contrasts.

In [98]:
contrast.ebs <- lapply(contrast.fits, function(x)(eBayes(x, proportion=0.1, trend=FALSE, robust=FALSE)))
contrast.tts <- lapply(contrast.ebs, function(x)(topTable(x, adjust="BH", number=length(x$coefficients), sort.by="none")))
contrast.tests <- lapply(contrast.ebs, function(x)(decideTests(x, method="separate", adjust.method="BH", p.value=0.05, lfc=0)))
In [100]:
tests.mat <- do.call(cbind, contrast.tests)
colnames(tests.mat) <- names(contrast.tests)
write.table(tests.mat, file=paste0("results/",my.gse,"_GeneCalls.txt"), sep="\t", quote=F)
In [101]:
names(contrast.ebs)
  1. 'C24vC06'
  2. 'C48vC06'
  3. 'C48vC24'
  4. 'T24vT06'
  5. 'T48vT06'
  6. 'T48vT24'
  7. 'T06vC06'
  8. 'T24vC24'
  9. 'T48vC48'
In [102]:
contrast.ebs["C24vC06"]
$C24vC06
An object of class "MArrayLM"
$coefficients
  1007_s_at     1053_at      117_at      121_at     1294_at 
 0.46334611  0.12367214 -0.07479366 -0.01468879  0.12255676 
24498 more rows ...

$rank
[1] 6

$assign
[1] 1 1 1 1 1 1

$qr
$qr
       C.06 C.24 C.48 T.06 T.24 T.48
C.06.1 -2.0    0    0  0.0    0    0
C.06.2  0.5   -2    0  0.0    0    0
C.06.3  0.5    0   -2  0.0    0    0
C.06.4  0.5    0    0 -2.0    0    0
T.06.1  0.0    0    0  0.5   -2    0
19 more rows ...

$qraux
[1] 1.5 1.0 1.0 1.0 1.0 1.0

$pivot
[1] 1 2 3 4 5 6

$tol
[1] 1e-07

$rank
[1] 6


$df.residual
[1] 18 18 18 18 18
24498 more elements ...

$sigma
 1007_s_at    1053_at     117_at     121_at    1294_at 
0.06312645 0.13052707 0.12118351 0.14320464 0.18807920 
24498 more elements ...

$cov.coefficients
     [,1]
[1,]  0.5

$stdev.unscaled
1007_s_at   1053_at    117_at    121_at   1294_at 
0.7071068 0.7071068 0.7071068 0.7071068 0.7071068 
24498 more rows ...

$Amean
1007_s_at   1053_at    117_at    121_at   1294_at 
11.296839  9.896381  6.066924  8.558662  6.670031 
24498 more elements ...

$method
[1] "ls"

$design
       C.06 C.24 C.48 T.06 T.24 T.48
C.06.1    1    0    0    0    0    0
C.06.2    1    0    0    0    0    0
C.06.3    1    0    0    0    0    0
C.06.4    1    0    0    0    0    0
T.06.1    0    0    0    1    0    0
19 more rows ...

$contrasts
     [,1]
C.06   -1
C.24    1
C.48    0
T.06    0
T.24    0
T.48    0

$df.prior
[1] 2.964189

$s2.prior
[1] 0.02802146

$var.prior
[1] 8.247079

$proportion
[1] 0.1

$s2.post
  1007_s_at     1053_at      117_at      121_at     1294_at 
0.007383542 0.018590396 0.016571063 0.021569979 0.034334220 
24498 more elements ...

$t
 1007_s_at    1053_at     117_at     121_at    1294_at 
 7.6258479  1.2827516 -0.8216836 -0.1414412  0.9353813 
24498 more rows ...

$df.total
[1] 20.96419 20.96419 20.96419 20.96419 20.96419
24498 more elements ...

$p.value
   1007_s_at      1053_at       117_at       121_at      1294_at 
1.779320e-07 2.135808e-01 4.205040e-01 8.888717e-01 3.602423e-01 
24498 more rows ...

$lods
1007_s_at   1053_at    117_at    121_at   1294_at 
 9.341037 -2.847505 -3.300248 -3.618282 -3.205296 
24498 more rows ...

$F
[1] 58.15355691  1.64545159  0.67516397  0.02000562  0.87493812
24498 more elements ...

$F.p.value
[1] 1.779320e-07 2.135808e-01 4.205040e-01 8.888717e-01 3.602423e-01
24498 more elements ...

In [103]:
head(contrast.tts[["C24vC06"]])
logFCAveExprtP.Valueadj.P.ValB
1007_s_at 0.46334611 11.296839 7.6258479 1.779320e-074.738987e-05 9.341037
1053_at 0.12367214 9.896381 1.2827516 2.135808e-015.613052e-01-2.847505
117_at-0.07479366 6.066924 -0.8216836 4.205040e-017.550092e-01-3.300248
121_at-0.01468879 8.558662 -0.1414412 8.888717e-019.709035e-01-3.618282
1294_at 0.12255676 6.670031 0.9353813 3.602423e-017.106528e-01-3.205296
1316_at-0.04052189 5.626551 -0.4534642 6.548735e-018.856241e-01-3.527119
In [104]:
head(contrast.tests)[["C24vC06"]]
1007_s_at1
1053_at0
117_at0
121_at0
1294_at0
1316_at0
1438_at0
1487_at0
1494_f_at0
1552256_a_at0
1552257_a_at0
1552263_at0
1552264_a_at0
1552272_a_at0
1552274_at0
1552275_s_at0
1552276_a_at0
1552277_a_at0
1552278_a_at0
1552279_a_at0
1552281_at0
1552283_s_at0
1552286_at0
1552287_s_at0
1552291_at0
1552295_a_at0
1552299_at0
1552301_a_at0
1552304_at0
1552306_at0
......
24474 0
24475 1
24476 0
24477 0
24478 0
24479 0
24480 0
24481 0
24482 0
24483 0
24484 0
24485 0
24486 0
24487-1
24488 0
24489 0
24490 0
24491 0
24492 0
24493 0
24494 0
24495 0
24496-1
24497 0
24498 0
24499 0
24500 0
24501 0
24502 0
24503 0

5.6 Assessing the Results

There are several ways to visualize the statistical results from the DGE analysis. The function plotMA will produce a scatter plot of average log2 expression on the x-axis vs. log2 fold change on the y-axis. The average log2 expression is the value for ALL arrays, not only the arrays used to calculate the log2 expression.

In [105]:
library(RColorBrewer)
ma.cols <- brewer.pal(11, "RdBu")[c(2, 10)]
my.contrast <- "C24vC06"
In [106]:
##Note: may need to use the form limma::plotMA to get the plotMA function from limma.  This is usually not necessary
plotMA(contrast.ebs[[my.contrast]], status=contrast.tests[[1]], hl.pch=20, hl.col=ma.cols, main=paste(my.gse, my.contrast, sep=" "))

Another similar visualization is the volcano plot which plot logFC on the x-axis and p-value (-log10) on the y-axis. It is necessary to use the -log10 of the p-value so that small p-values will be plotted as large y-values. Limma has a volcanoplot function, but it plots log odds rather than p-values. To highlight DEGs, we will replot the values for DEGs using a different symbol and specific colors

In [107]:
ma.cols <- c(brewer.pal(11, "RdBu")[10], "grey50", brewer.pal(11, "RdBu")[2])
my.cols <- ma.cols[as.factor(contrast.tests[[my.contrast]][, 1])]
table(my.cols)
my.cols
#2166AC #B2182B  grey50 
    803     978   22722 
In [108]:
plot(contrast.tts[[my.contrast]]$logFC, -log10(contrast.tts[[my.contrast]]$adj.P.Val), col=my.cols, pch=19, main=paste(my.gse, my.contrast, sep=" "), xlab="log2 Fold Change", ylab="-log10 Adjusted P-Value")

Finally, let's revisit the MDS plots using only the genes that are differentially-expressed. The first step is to extract the names of the probesets from our tables, and use these names to subset the expression data in the original Expression Set object, i.e. "my.rma". This time, let's make a plot that is potentially publication quality.

In [109]:
up.probes <- rownames(contrast.tests[[my.contrast]])[contrast.tests[[my.contrast]][, 1] == 1]
plotMDS(exprs(my.rma)[up.probes, ], pch=rep(21:24, 6), bg=level.cols, labels=NULL, top=length(up.probes), main=paste(my.gse, "MDS Plot to Compare Up-Regulated Genes"), cex=1.5, cex.main=0.9, sub=my.contrast)
legend1 <- c("array1", "array2", "array3", "array4")
legend2 <- c("C06", "C24", "C48", "T06", "T24", "T48")
legend("top", horiz=T, bty="n", legend=legend1, pch=c(21, 22, 23, 24), cex=0.8)
legend("bottom", horiz=T, bty="n", legend=legend2, fill=level.pal, cex=0.8)

This looks considerably better than our first MDS plot, and not just in terms of aesthetics. You should see that all replicated arrays for each treatment tend to cluster together. Further, treated and control samples are separated along the y-axis, and time points are separated along the x-axis. Should the plots be similar if we examine the down-regulated genes or all DEGs combined? What if we look at the genes that were unchanged?

In [110]:
down.probes <- rownames(contrast.tests[[my.contrast]])[contrast.tests[[my.contrast]][, 1] == -1]
plotMDS(exprs(my.rma)[down.probes, ], pch=rep(21:24, 6), bg=level.cols, labels=NULL, top=length(down.probes), main=paste(my.gse, "MDS Plot to Compare Down-Regulated Genes"), cex=1.5, cex.main=0.9, sub=my.contrast)
legend1 <- c("array1", "array2", "array3", "array4")
legend2 <- c("C06", "C24", "C48", "T06", "T24", "T48")
legend("top", horiz=T, bty="n", legend=legend1, pch=c(21, 22, 23, 24), cex=0.8)
legend("bottom", horiz=T, bty="n", legend=legend2, fill=level.pal, cex=0.8)
In [111]:
deg.probes <- union(up.probes, down.probes)
plotMDS(exprs(my.rma)[deg.probes, ], pch=rep(21:24, 6), bg=level.cols, labels=NULL, top=length(deg.probes), main=paste(my.gse, "MDS Plot to Compare DEGs"), cex=1.5, cex.main=0.9, sub=my.contrast)
legend1 <- c("array1", "array2", "array3", "array4")
legend2 <- c("C06", "C24", "C48", "T06", "T24", "T48")
legend("top", horiz=T, bty="n", legend=legend1, pch=c(21, 22, 23, 24), cex=0.8)
legend("bottom", horiz=T, bty="n", legend=legend2, fill=level.pal, cex=0.8)
In [112]:
not.deg.probes <- setdiff(rownames(exprs(my.rma)), deg.probes)
plotMDS(exprs(my.rma)[not.deg.probes, ], pch=rep(21:24, 6), bg=level.cols, labels=NULL, top=1000, main=paste(my.gse, "MDS Plot to Compare Unchanged Genes"), cex=1.5, cex.main=0.9, sub=my.contrast)
legend1 <- c("array1", "array2", "array3", "array4")
legend2 <- c("C06", "C24", "C48", "T06", "T24", "T48")
legend("top", horiz=T, bty="n", legend=legend1, pch=c(21, 22, 23, 24), cex=0.8)
legend("bottom", horiz=T, bty="n", legend=legend2, fill=level.pal, cex=0.8)

What do you make of this final plot? Is this reasonable? At some point, produce the plots for the other contrasts. In addition, we stopped saving PDFs of the plots. Which plots would you save as PDFs to document the QC? Use this script as a guide to make these PDFs. Spend some time reading the reading the help documents for the plotting function to help you label axes, etc.

5.7 Gene Annotations for Microarray Probesets

We have the DGE results for our probesets, but how do we convert these to genes? Many web tools can do this for you. For example, DAVID can take a list of Affy probeset IDs and convert them to various other identifiers. Here, we are going to use the Bioconductor database for the Affy Hgu133Plus2 array.

In [113]:
library (hgu133plus2.db)
columns(hgu133plus2.db)
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: org.Hs.eg.db


  1. 'ACCNUM'
  2. 'ALIAS'
  3. 'ENSEMBL'
  4. 'ENSEMBLPROT'
  5. 'ENSEMBLTRANS'
  6. 'ENTREZID'
  7. 'ENZYME'
  8. 'EVIDENCE'
  9. 'EVIDENCEALL'
  10. 'GENENAME'
  11. 'GO'
  12. 'GOALL'
  13. 'IPI'
  14. 'MAP'
  15. 'OMIM'
  16. 'ONTOLOGY'
  17. 'ONTOLOGYALL'
  18. 'PATH'
  19. 'PFAM'
  20. 'PMID'
  21. 'PROBEID'
  22. 'PROSITE'
  23. 'REFSEQ'
  24. 'SYMBOL'
  25. 'UCSCKG'
  26. 'UNIGENE'
  27. 'UNIPROT'
In [114]:
keytypes(hgu133plus2.db)
  1. 'ACCNUM'
  2. 'ALIAS'
  3. 'ENSEMBL'
  4. 'ENSEMBLPROT'
  5. 'ENSEMBLTRANS'
  6. 'ENTREZID'
  7. 'ENZYME'
  8. 'EVIDENCE'
  9. 'EVIDENCEALL'
  10. 'GENENAME'
  11. 'GO'
  12. 'GOALL'
  13. 'IPI'
  14. 'MAP'
  15. 'OMIM'
  16. 'ONTOLOGY'
  17. 'ONTOLOGYALL'
  18. 'PATH'
  19. 'PFAM'
  20. 'PMID'
  21. 'PROBEID'
  22. 'PROSITE'
  23. 'REFSEQ'
  24. 'SYMBOL'
  25. 'UCSCKG'
  26. 'UNIGENE'
  27. 'UNIPROT'
In [115]:
gene.data <- select(hgu133plus2.db, keys=rownames(contrast.tts[["C24vC06"]]), keytype="PROBEID", columns=c("ENTREZID", "GENENAME", "SYMBOL"))
head(gene.data)
'select()' returned 1:many mapping between keys and columns
PROBEIDENTREZIDGENENAMESYMBOL
1007_s_at 780 discoidin domain receptor tyrosine kinase 1 DDR1
1007_s_at 100616237 microRNA 4640 MIR4640
1053_at 5982 replication factor C subunit 2 RFC2
117_at 3310 heat shock protein family A (Hsp70) member 6HSPA6
121_at 7849 paired box 8 PAX8
1294_at 7318 ubiquitin like modifier activating enzyme 7 UBA7
In [116]:
dim(gene.data)
  1. 26083
  2. 4
In [117]:
dim(contrast.tts[["C24vC06"]])
  1. 24503
  2. 6

How did we end up with more genes than probesets? There are several possible causes. First, some probesets map to multiple genes. Second, maybe some genes map to multiple symbols.

In [118]:
summary(duplicated(gene.data$PROBEID))
   Mode   FALSE    TRUE    NA's 
logical   24503    1580       0 
In [119]:
summary(duplicated(gene.data$SYMBOL))
   Mode   FALSE    TRUE    NA's 
logical   13562   12521       0 
In [120]:
summary(duplicated(gene.data$ENTREZID))
   Mode   FALSE    TRUE    NA's 
logical   13562   12521       0 
In [121]:
table(duplicated(gene.data$SYMBOL), duplicated(gene.data$PROBEID), useNA="always", dnn=list("gene", "probe"))
       probe
gene    FALSE  TRUE  <NA>
  FALSE 12653   909     0
  TRUE  11850   671     0
  <NA>      0     0     0
In [122]:
table(duplicated(gene.data$ENTREZID), duplicated(gene.data$PROBEID))
       
        FALSE  TRUE
  FALSE 12653   909
  TRUE  11850   671
In [123]:
table(duplicated(gene.data$ENTREZID), duplicated(gene.data$SYMBOL))
       
        FALSE  TRUE
  FALSE 13562     0
  TRUE      0 12521

The problem seems to be that some probesets map to multipe genes and some genes map to multiple probesets. It is not an issue of multiple SYMBOLS for the same ENTREZID. Try including ALIAS in your query to see how many rows you get back.

5.8 Saving Your Analysis

To ensure clarity and reproducibility, you should save your results with clear documentation. One piece of documentation that is rarely found is the EXACT computer code that was used to process your data. There is no need for this. ALWAYS, save the specific R script that you used in an analysis. It can help you in may ways. What experimental and annotation data did you use. Versions of software, etc.

For a limma analysis, we recommend that you save the complete output from topTable combined with the decideTests output. You may also want to save your RMA values before and after filtering the data. You can also save the fitted coefficients from lmFit.

Remember to limit the number of variables in gene.data. Use only ENTREZID, GENENAME and SYMBOL before you merge the gene data and the results.

In [124]:
my.results <- lapply(names(contrast.tts), function(x)(cbind(contrast.tts[[x]], contrast.tests[[x]])))
names(my.results) <- names(contrast.tts)
for(i in 1:length(my.results)){
  my.results[[i]]$PROBEID <- rownames(my.results[[i]])
  colnames(my.results[[i]])[7] <- "test"
}
In [129]:
my.results <- lapply(my.results, function(x)(merge(gene.data, x, by="PROBEID")))
head(my.results[[1]])
for(i in 1:length(my.results)){
  write.table(my.results[[i]], file=paste0("results/",my.gse,"_", names(my.results)[i], "_Limma_Out.txt"), row.names=FALSE, sep="\t", quote=FALSE)
}
PROBEIDENTREZID.xGENENAME.xSYMBOL.xENTREZID.yGENENAME.ySYMBOL.ylogFCAveExprtP.Valueadj.P.ValBtest
1007_s_at 780 discoidin domain receptor tyrosine kinase 1 DDR1 780 discoidin domain receptor tyrosine kinase 1 DDR1 0.46334611 11.296839 7.6258479 1.779320e-07 4.738987e-05 9.341037 1
1007_s_at 780 discoidin domain receptor tyrosine kinase 1 DDR1 100616237 microRNA 4640 MIR4640 0.46334611 11.296839 7.6258479 1.779320e-07 4.738987e-05 9.341037 1
1007_s_at 100616237 microRNA 4640 MIR4640 780 discoidin domain receptor tyrosine kinase 1 DDR1 0.46334611 11.296839 7.6258479 1.779320e-07 4.738987e-05 9.341037 1
1007_s_at 100616237 microRNA 4640 MIR4640 100616237 microRNA 4640 MIR4640 0.46334611 11.296839 7.6258479 1.779320e-07 4.738987e-05 9.341037 1
1053_at 5982 replication factor C subunit 2 RFC2 5982 replication factor C subunit 2 RFC2 0.12367214 9.896381 1.2827516 2.135808e-01 5.613052e-01 -2.847505 0
117_at 3310 heat shock protein family A (Hsp70) member 6HSPA6 3310 heat shock protein family A (Hsp70) member 6HSPA6 -0.07479366 6.066924 -0.8216836 4.205040e-01 7.550092e-01 -3.300248 0

Now, we have the results from our DGE analysis saved and ready for the next step--making biological sense of the results by visualizing the data to detect patterns or comparing our list to lists of functionally-related genes.

In [ ]:
save.image(paste0("Introduction_to_Microarray_Analysis_",my.gse,".RData"))

6 SessionInfo

In [127]:
sessionInfo();
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] hgu133plus2.db_3.2.3  org.Hs.eg.db_3.3.0    AnnotationDbi_1.34.4 
 [4] IRanges_2.6.1         S4Vectors_0.10.3      RColorBrewer_1.1-2   
 [7] limma_3.28.21         hgu133plus2cdf_2.18.0 affy_1.50.0          
[10] GEOquery_2.38.4       Biobase_2.32.0        BiocGenerics_0.18.0  

loaded via a namespace (and not attached):
 [1] BiocInstaller_1.22.3  bitops_1.0-6          tools_3.3.1          
 [4] zlibbioc_1.18.0       digest_0.6.10         uuid_0.1-2           
 [7] jsonlite_1.1          evaluate_0.9          RSQLite_1.0.0        
[10] preprocessCore_1.34.0 IRdisplay_0.4.4       DBI_0.5-1            
[13] IRkernel_0.7          repr_0.9              httr_1.2.1           
[16] stringr_1.1.0         R6_2.2.0              XML_3.98-1.4         
[19] pbdZMQ_0.2-4          magrittr_1.5          stringi_1.1.2        
[22] RCurl_1.95-4.8        crayon_1.3.2          affyio_1.42.0        
In [ ]:

In [ ]:

In [ ]: