The microarray image technology makes it possible to measure the simultaneous expressions of thousands of genes. It has become the standard tool to investigate fundamental biological functions. It is widely used in laboratories of academia and industry, producing vast quantities of image data. Experiments are expensive and future re-processing of the data may be necessary due to the still evolving statistical modeling. For these reasons, the full image data are always kept, resulting in immense storage requirements. This calls for compression schemes which take into account the statistical inference that is to follow, or a new definition of "irrelevance" of image features based on the loss of statistical information rather than visual distortion.

In this talk I present a microarray image compression scheme with a multi-level (lossless or lossy) coded data structure which facilitates statistical analysis and data transmission. As components, it uses adaptive segmentation, predictive coding, and wavelet transforms.

The high noise levels of microarray image data suggests the use of lossy compression. However, lossy compression necessarily leads to the loss of statistical information, which may affect future statistical modeling and inference. I address the question of optimal statistical estimation based on lossily compressed data and present a new upper bound on the minimum achievable loss of estimation efficiency due to compression. This is an interesting information-theoretic result in the field of multiterminal data compression, and can be used to evaluate the performance of practical compression schemes applied to microarray images.

Time permitting, I will briefly discuss the use of a statistical modeling principle based on data compression or coding: Rissanen's minimum description length (MDL) principle, for gene subset selection based on microarray data for classification of types of Leukemia.