Deep Neural Nets, Scalable Computing and Finance - Department of Statistics - Purdue University Skip to main content

Deep Neural Nets, Scalable Computing and Finance

Organizer and Chair: Kiseop Lee, Associate Professor of Statistics, Department of Statistics, Purdue University

Speakers

  • Colm O'Cinneide, Head of Portfolio Construction, QS Investors LLC
  • Xiao Wang, Professor of Statistics, Department of Statistics, Purdue University
  • Faming Liang, Professor of Statistics, Department of Statistics, Purdue University
  • Kylie Ariel Bemis, Future Faculty Fellow, College of Computer and Information Science, Northeastern University
Schedule

Friday, June 8, 10:00 a.m.-12:00 p.m. in STEW 202

Time Speaker Title
10:00-10:30 a.m. Colm O'Cinneide Three theorems on risk contributions

Abstract: The idea of a "contribution to risk" was introduced into finance by Fischer Black and Robert Litterman when they worked together at Goldman Sachs in the 1980s and early 1990s. The concept has been adopted widely as the basis of risk decomposition and attribution in investment management. This talk concerns three basic facts about risk contributions that are apparently not well known. The first theorems concerns a form of duality identified in Grinold (2011), which may be described as follows. When we view a portfolio decomposition as a coordinate representation of the portfolio with respect to a given vector-space basis, then there is a natural dual basis with respect to which there is an alternative decomposition, here referred to as the dual decomposition. The dual decomposition gives the same contributions to risk as the original decomposition. The first theorem gives necessary and sufficient conditions for a change of basis to preserve risk contributions, and shows that all such changes of basis can be explained in terms of dual decompositions. The second theorem explores sensitivity of portfolio risk to a risk regime change and indicates that large risk contributions and large risks of the components of a decomposition may be harbingers of high sensitivity. This provides a motivation for the practice of reporting both the risk contributions and the risks of the components in a decomposition. The third theorem provides necessary and sufficient conditions for two sets of numbers to be the risk contributions and risks of some portfolio decomposition. Such risk contributions and risks can always be realized in a space of portfolios of dimension no greater than 3, a fact that places limits on what can be gleaned from knowledge of risk contributions and risks only.

This work is to appear in Quantitative Finance and is available online at https://www.tandfonline.com/eprint/BS5MhkiVRZUcc2F9t6fY/full. An earlier version is on SSRN at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2955663.

10:30-11:00 a.m. Xiao Wang Weight Normalized Deep Neural Networks
Abstract: Analysis of big data demands computer aided or even automated model building. It becomes extremely difficult to analyze such data with traditional statistical models. Deep learning has proved to be successful for a variety of challenging problems such as AlphaGo, driverless cars, and image classification. Understanding of deep learning has however apparently been limited, which makes it difficult to be fully developed. In this talk, we study the capacity as well as generalization properties of deep neural networks (DNNs) under different scenarios of weight normalization. We establish the upper bound on the Rademacher complexities of this family. In particular, with an L_{1,q} normalization, we have an architecture-independent capacity control. For the regression problem, we provide both the generalization bound and the approximation bound. It is shown that both generalization and approximation errors can be controlled by the L_{1,\infty} weight normalization without relying on the network width and depth. This is a joint work with Yixi Xu.
11:00 - 11:30 a.m. Faming Liang Markov Neighborhood Regression for High-Dimensional Inference
Abstract: We propose an innovative method for constructing p-values and confidence intervals in high-dimensional regression. Unlike the existing methods, such as desparsified-Lasso, ridge projection and multi sample-splitting, which strive to work on the original scale high-dimensional problem, the proposed method successfully reduces the original high-dimensional inference problem to a series of low-dimensional inference problems by making use of conditional independence relations between different predictors. The proposed method has been tested on high-dimensional linear, logistic and Cox regression. The numerical results indicate that the proposed method significantly outperforms the existing ones.  The idea of using conditional independence relations for dimension reduction is general and potentially can be extended to other high-dimensional or big data problems as well.
11:30 a.m. -12:00 p.m. Kylie Bemis

Scalable R computing with big data-on-disk for bioinformatics and beyond

Abstract: A common challenge in bioinformatics is the proliferation of large, heterogenous datasets, stored across many disjoint files and in specialized file formats. Frequently, such datasets exceed computer memory, either individually or in aggregate. Such data is not only big, but also complex, and poses a major computational challenge to a statistician attempting to develop domain-specific statistical methods in R.

Traditional solutions for this problem have been either to convert the data into a file format compatible with R packages built for this purpose (e.g., bigmemory or rhdf5) or to incorporate another technology designed for big data such as Spark or Hadoop. Conversion comes with time and storage costs, and potential loss of information, impacting reproducibility. Rewriting with another technology also costs time and means that the user can no longer easily leverage the rich resources from over 10,000 existing R packages.

We present the R package matter, which attempts to solve these issues by providing a flexible interface to data on disk without requiring conversion, and which allows aggregation of arbitrarily many files into a single R data structure addressable as an on-disk matrix, array, or data frame. This is achieved by a flexible data representation that abstracts the structure of data-on-disk away from the in-memory structure represented and accessible in R. In a comparison with similar packages such as bigmemory or ff on larger-than-memory datasets, matter performed similarly or better in terms of speed, while typically using less memory.

To demonstrate the utility of matter, we consider mass spectrometry (MS) imaging as a case study. MS is a backbone technology for molecular biology, and MS imaging allows investigation of the spatial distribution of molecular analytes in a sample. The resulting high-dimensional biological imaging datasets are both large and complex.
MS imaging has rapidly adopted imzML as a common open-source file format for sharing data. However, a single imzML file may be very large (10s of GB) for high-resolution experiments, and a single experiment may include data from dozens of files. We have integrated matter with Cardinal, our R package for statistical analysis of MS imaging experiments. We demonstrate how matter allows Cardinal to work with larger-than-memory MS imaging experiments by showing results for principal components analysis (PCA) on all MSI datasets in a public Gigascience repository, ranging up to 42 GB in size, performed on a laptop with 16 GB memory.

Purdue Department of Statistics, 150 N. University St, West Lafayette, IN 47907

Phone: (765) 494-6030, Fax: (765) 494-0558

© 2023 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.