William S. Cleveland

Shanti S. Gupta Professor of Statistics

Courtesy Professor of Computer Science

Haas Building 222

250 N. University St., Lafayette, IN 47907

wsc AT purdue DOT edu


PDF of Web Page Content + Publications


Background and Past Research


William S. Cleveland has been the Shanti S. Gupta Distinguished Professor of Statistics and Courtesy Professor of Computer Science at Purdue University since 1/1/2014. Previous to this, he was a Distinguished Member of Technical Staff in the Statistics Research Department at Bell Labs, Murray Hill; for 12 years he was the Department Head.


Cleveland received an A.B. in Mathematics from Princeton University; his senior thesis adviser was William Feller. He received his Ph.D. in Statistics from Yale University; his Ph.D. thesis adviser was Leonard Jimmie Savage.

Awards and Honors

In 1996 Cleveland was chosen national Statistician of the Year by the Chicago Chapter of the American Statistical Association. In 2002 he was selected as a Highly Cited Researcher by the American Society for Information Science and Technology in the newly formed mathematics category. He has twice won the Wilcoxon Prize and once won the Youden Prize from Technometrics. He is a Fellow of the American Statistical Association, the Institute of Mathematical Statistics, and the American Association of the Advancement of Science, and is an Elected Member of the International Statistical Institute.

Data Science

In a talk at the 1999 meeting of the International Statistical Institute and in a 2001 paper, number [25] in the list of publications in the above PDF, Cleveland defined data science as it is used today. It had been used before, but with different meanings. See the Wikipedia Web page Data Science. The paper was republished in 2014 [1] together with a discussion and with another paper about D&R with Tessera [2], described next, which requires work in all technical areas of data science.

The technical areas of data science are those that have an impact on how a data analyst analyses data: (1) Statistical theory; (2) Statistical models; (3) Statistical and machine-learning methods; (4) Algorithms for statistical and machine-learning methods, and optimization; (5) Computational systems for data analysis; (6) Live analyses of data where results are judged by the findings, not the methodology and systems that where used.

The implications for an academic department are that it is not necessary each individual to research in all areas. Rather, collectively, the department needs to have research in all areas. There must be an exchange of knowledge so that all department members have at least a basic understanding of all areas.

Areas of Research

Cleveland's areas of research have been in statistics, machine learning, data visualization, data analysis for multidisciplinary studies, and high performance computing for deep data analysis.

Data Analysis Projects

Cleveland has been involved in many projects requiring the analysis and modeling of very diverse datasets from many fields, including computer networking, healthcare engineering, telecommunications, homeland security, environmental monitoring, public opinion polling, cyber security, and visual perception. Since circa 2008, many of the analyzed datasets have been big in size and required analytic methods with high computational complexity.

Widely-Used Methods and Their Publication

In the course of this work in data analysis, Cleveland has developed many new analytic methods and new computer systems for data analysis that are used throughout the worldwide technical community. He has published 129 papers and 3 books on this work. See the PDF above for a chronological list. For citations to the publications, see the Web page Google Citations .

Data Visualization

In data visualization, Cleveland has written two books, co-authored another and one user's manual, and was the Editor of two books and a special issue of the Journal of the American Statistical Association. He is the founder of the Graphics Section of the American Statistical Association, which means he led the group that successfully petitioned the ASA board of directors for approval.

His two books on data visualization have been reviewed in many journals from a wide variety of disciplines. The Elements of Graphing Data was selected for the Library of Science Book Club. J. Lodge reviewed it in Atmospheric Environment and wrote: "certain kinds of tendency toward bad graphics could be cured if as many authors as possible would not just read, but, in the words of the Anglican Prayer Book, `learn, mark, and inwardly digest' this volume." B. Gunter reviewed Visualizing Data in Technometrics and wrote: "This is a terrific book --- in my opinion, a path-breaking book. Get it. Read it. Practice what it preaches. You will improve the quality of your data analysis."

Cleveland and colleagues developed trellis display, a powerful framework for data visualization. It has been used by a large, worldwide community of data analysts as a result of its implementation in the two software systems based on the S language, the S-Plus commercial system and the R open source system.

Current Research

The Term "Big Data" Misses Badly

The widely used term ``big data'' carries with it a notion of computational performance for the analysis of big datasets. But for data analysis, computational performance depends very heavily, not just on size, but on the computational complexity of the analytic routines used in the analysis. Data small in size can be a big challenge, too. Furthermore, the hardware power available to the data analyst is an important factor.

High Performance Computing for Deep Data Analysis

Cleveland's current research is in High Performance Computing for Deep Data Analysis (HPC-DDA). The goal is to enable both deep analysis and HPC.

HPC means computations are feasible and practical for wide ranges of dataset size, computational complexity, and hardware power. Deep analysis means analyzing data at their finest granularity, and not just summary statistics. Deep analysis also means that the analyst can apply any of 1000s of methods of statistics, machine learning, and data visualization.

The goal is achieved by work in the Divide & Recombine (D&R) statistical approach to analysis, and the Tessera D&R software implementation that makes programming D&R easy. The work ranges from statistical theory to cluster design, covering all of the areas of data science; furthermore, work in the different areas is highly integrated, one area affecting another. Integrated work in data science is necessary to succeed.

Divide and Recombine (D&R)

Cleveland and colleagues have been developing D&R since 2009. In D&R, the data are divided into subsets (div), analytic methods are applied to each subset independently without communication between subsets (ana), and the subsets' outputs for each method are recombined (rec). Research in statistical theory seeks division methods and recombination methods to optimize the statistical accuracy of D&R results. While D&R is a statistical approach, its goal is HPC. Most of the D&R computation is embarrassingly parallel, the simplest and fastest parallel computation.


Tessera D&R software runs on a cluster. The front end has R and the Tessera datadr R package that makes D&R programing easy. The back end, the Hadoop distributed file system and parallel compute engine, executes the datadr/R commands: (div), (ana), and (rec). In between, the R package RHIPE (R and Hadoop Integrated Programming Environment) provides communication between datadr and Hadoop.

PhD Students

Cleveland is currently the advisor for 6 students; since joining Purdue in 2004, he has advised 18 students. The students have made major contributions to research in D&R with Tessera.

More Information

For more information see the Web page tesser.io and papers [2-3],[7], [12] in the above PDF.