Recommended Reading List (Textbooks)

Many people have asked me in person about pointers to good books for ramp-up getting into the field. I’ve casually passed around lists quite often, but I thought I’d share it here.

Below I will begin compiling a list of books (though some may simply be manuscripts from professors) that are well known, read, and/or cited for Ph.D. students to grip the noteworthy theories and practices. I will update this list frequently so please feel free to come back often. (Here I’ve temporarily stripped the correct citation format to be edited later.)

“Birds-eye-view” textbooks:

Pattern Recognition and Machine Learning. Christopher Bishop.

Machine Learning: A Probabilistic Perspective. Kevin P. Murphy.

Advanced Data Analysis from an Elementary Point of View. Cosma Rohilla Shalizi.

Subject-focused textbooks / manuscripts:

Graphical Models

Graphical Models, Exponential Families, and Variational Inference. Martin J. Wainwright, Michael I. Jordan.

An Introduction to Conditional Random Fields. Charles Sutton, Andrew McCallum.

Discrete Models

Categorical Data Analysis. Alan Agresti.


Introductory Lectures on Convex Optimization. Yurii Nesterov.

Convex Optimization. Stephen Boyd, Lieven Vandenberghe.

Deep Learning

Deep Learning. Ian Goodfellow, Yoshua Bengio, Aaron Courville.

Probability Theory / Measure Theory

Introduction to Probability Models. Sheldon M. Ross.

Measure Theory and Fine Properties of Functions. Lawrence Craig Evans, Ronald F. Gariepy

Probability Essentials. Jean Jacod, Philip Protter.

Probabilistic Symmetries and Invariance Principles. Olav Kallenberg.

Stochastic Process / Stochastic Differential Equations

Poisson Processes. J. F. C. Kingman.

Stochastic Methods. Crispin Gardiner.

An Introduction to Stochastic Differential Equations. Lawrence Craig Evans.

Stochastic Differential Equations: An Introduction with Applications.Bernt Øksendal.

Determinantal point processes for machine learning. Alex Kulesza, Ben Taskar.

Gaussian Processes for Machine Learning. Carl Edward Rasmussen, Christopher K. I. Williams.

Optimal Transport

Computational Optimal Transport. Gabriel Peyré, Marco Cuturi.

Linear Algebra


Real Analysis


Complex Analysis


Functional Analysis

Ordinary / Partial Differential Equations

Partial Differential Equations. Lawrence Craig Evans.

Differential Geometry

A Comprehensive Introduction to Differential Geometry. Volume One, Two and Three. Michael Spivak.

Statistical Inference

Statistical Inference. George Casella, Roger L. Berger.

Testing Statistical Hypotheses. Erich L. Lehmann, Joseph P. Romano.

Semiparametric Theory and Missing Data. Anastasios A. Tsiatis.

Computer Age Statistical Inference: Algorithms, Evidence and Data Science. Bradley Efron, Trevor Hastie.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, Jerome Friedman.

Bayesian Statistics

Bayesian Data Analysis. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin.

Bayesian Approximate Inference

Handbook of Markov Chain Monte Carlo. Steve Brooks, Andrew Gelman, Galin L. Jones, Xiao-Li Meng.

Reinforcement Learning

Reinforcement Learning: An Introduction. Richard S. Sutton, Andrew G. Barto.

Causal Inference

Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Guido W. Imbens, Donald B. Rubin.

Causality: Models, Reasoning and Inference. Judea Pearl.

Counterfactuals and Causal Inference: Methods and Principles for Social Research. Christopher Morgan, Stephen Winship.

Causal Inference. Miguel A. Hernán, James M. Robins.

Elements of Causal Inference: Foundations and Learning Algorithms. Jonas Peters, Dominik Janzing, Bernhard Schölkopf.

Information Retrieval

Information Retrieval. Christopher Manning, Prabhakar Raghavan, Hinrich Schütze.

Data Mining



Mostly Harmless Econometrics: An Empiricist’s Companion. Joshua D. Angrist, Jörn-Steffen Pischke.

Mathematical & Computational Finance


Quantum Physics / Chemistry


Algebraic & Computational Game Theory



Michael Jordan’s Reading List

From a very old post of mine:

[UPDATE] Additional books from Mike Jordan’s AMA:

  1. A. Tsybakov’s book “Introduction to Nonparametric Estimation” as a very readable source for the tools for obtaining lower bounds on estimators
  2. Y. Nesterov’s very readable “Introductory Lectures on Convex Optimization” as a way to start to understand lower bounds in optimization
  3. A. van der Vaart’s “Asymptotic Statistics”, a book that we often teach from at Berkeley, as a book that shows how many ideas in inference (M estimation—which includes maximum likelihood and empirical risk minimization—the bootstrap, semiparametrics, etc) repose on top of empirical process theory
  4. B. Efron’s “Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction”, as a thought-provoking book

I saw this awesome list of books that Prof. Jordan recommended for going in Machine Learning in depth, quite awesome (though VERY deep and research-oriented):

Mike Jordan at Berkeley sent me his list on what people should learn for ML. The list is definitely on the more rigorous side (ie aimed at more researchers than practitioners), but going through these books (along with the requisite programming experience) is a useful, if not painful, exercise.I personally think that everyone in machine learning should be (completely) familiar with essentially all of the material in the following intermediate-level statistics book:

1.) Casella, G. and Berger, R.L. (2001). “Statistical Inference” Duxbury Press.

For a slightly more advanced book that’s quite clear on mathematical techniques, the following book is quite good:

2.) Ferguson, T. (1996). “A Course in Large Sample Theory” Chapman & Hall/CRC.

You’ll need to learn something about asymptotics at some point, and a good starting place is:

3.) Lehmann, E. (2004). “Elements of Large-Sample Theory” Springer.

Those are all frequentist books. You should also read something Bayesian:

4.) Gelman, A. et al. (2003). “Bayesian Data Analysis” Chapman & Hall/CRC.

and you should start to read about Bayesian computation:

5.) Robert, C. and Casella, G. (2005). “Monte Carlo Statistical Methods” Springer.

On the probability front, a good intermediate text is:

6.) Grimmett, G. and Stirzaker, D. (2001). “Probability and Random Processes” Oxford.

At a more advanced level, a very good text is the following:

7.) Pollard, D. (2001). “A User’s Guide to Measure Theoretic Probability” Cambridge.

The standard advanced textbook is Durrett, R. (2005). “Probability: Theory and Examples” Duxbury.

Machine learning research also reposes on optimization theory. A good starting book on linear optimization that will prepare you for convex optimization:

8.) Bertsimas, D. and Tsitsiklis, J. (1997). “Introduction to Linear Optimization” Athena.

And then you can graduate to:

9.) Boyd, S. and Vandenberghe, L. (2004). “Convex Optimization” Cambridge.

Getting a full understanding of algorithmic linear algebra is also important. At some point you should feel familiar with most of the material in

10.) Golub, G., and Van Loan, C. (1996). “Matrix Computations” Johns Hopkins.

It’s good to know some information theory. The classic is:

11.) Cover, T. and Thomas, J. “Elements of Information Theory” Wiley.

Finally, if you want to start to learn some more abstract math, you might want to start to learn some functional analysis (if you haven’t already). Functional analysis is essentially linear algebra in infinite dimensions, and it’s necessary for kernel methods, for nonparametric Bayesian methods, and for various other topics. Here’s a book that I find very readable:

12.) Kreyszig, E. (1989). “Introductory Functional Analysis with Applications” Wiley.

Merging the Bayesian tools with NN’s

This is a copy of a post I had way back:

My training at Columbia was quite a condensed crash course to Bayesian statistics. This included but was not limited to the typical probabilistic graphical model thinking, the generative model constructions, the hierarchical extensions to all sorts of tools, the stochastic processes (and often the measure theory entangled within) and its nonparametric (as well as semi-parametric siblings) modern applications. Well, and the inference + sampling, of course. Only a small crowd of academia at Columbia (back in time) seemed to really care about neural nets, despite its wide-spread industrial success, and so these things were a bit underexposed to me, until reading a lot on word embeddings lately.

Knowing the work from many famous Bayesian ML folks, like Yee Whye Teh, Mike Jordan Max Welling and D.P. Kingma, the hybrid and integration of probabilistic generative models with neural nets are more appealing than ever. Many of the fancy tools and smart tricks (e.g. variational inference, Gaussian processes, etc.) found their adopted versions under the neural net community, which sparked quite some waves into scaling inference and more robust modeling for enhancing NN’s.

OpenAI’s blog post on generative modeling certainly got me hooked. Unlike many supervised NN’s, they focused a lot on unsupervised models, sometimes changing the perspective of viewing the problem in a different way (e.g. Generative Adversarial Networks), which is quite appealing. I may consider relocating into this field in the future.

Now in hindsight there are a lot more to sit down and write. I will be offering some short discussions on hybrid models very soon.

Recommendations on Columbia Courses for Machine Learning & Statistics

For years I’ve kept an extremely long list of resources, from online and offline, of various forms, for machine learning, statistics, programming, video game production and many more. Since many people asked me what to take at Columbia to advance their career in data science, let me begin with one tiny fraction of that: Great Columbia courses that teach you theory, practice, implementation and thought-process on doing the pro way. If you’re  starting at Columbia with some basic knowledge of statistics and probability, this is the list of road-to-pro I’ve came up with (most of these I’ve taken or audited, some listed with old course numbers before they swapped to the GR ones):

Ph.D. level courses:

  • Bayesian Data Analysis (Andrew Gelman, STAT G6103) (Applied Bayesian statistics, social studies, hierarchical regression models, PPC, Bayesian inference, Stan, etc.)
  • Computational Statistics (Liam Paninski, STAT G6104)
  • Gaussian Process & Kernel Methods (John Cunningham, STAT G8325) (RKHS, kernels, advanced approximate Bayesian inference, GP, etc.)
  • Bayesian Nonparametrics (John Paisley, ELEN 9801) (Beta process, Poisson process, DP, IBP, etc.)
  • Statistical Communication (Andrew Gelman)
  • Causal Inference (Jose Zubizarreta, DROM B9124)
  • Foundations of Graphical Models (David Blei, STAT G6509)
  • Applied Causality (David Blei, STAT GR8101)
  • Probabilistic Models with Discrete Data (David Blei, COMS 6998)
  • Probability Theory I (Marcel Nutz, STAT GR6301) (Probability, measure, expectations, LLN, CLT, etc.)
  • Probability Theory II (Peter Orbanz, STAT G6106) (Topology, filtrations, measure theory, Martingales, etc.)
  • Probability Theory III (Marcel Nutz, STAT GR6303) (semi-Martingales, stochastic process, Weiner process, SDE, etc.)
  • Statistical Inference II (Sumit Mukherjee, STAT GR6202) (Statistical testing, nonparametric inference, etc.)
  • Statistical Inference III (Zhiliang Ying, STAT GR6203) (Semiparametric inference, etc.)
  • Bayesian Nonparametrics (Peter Orbanz, STAT GR8201) (DP, IBP, GP, PP, random measures, approximate inference)
  • Neural Networks and Deep Learning (Aurel Lazar, COMS 6998)
  • Seminar in Theoretical Statistics (STAT GR9201)
  • High Dimensional Data Analysis (Aleksandr Aravkin & Aurelie Lozano, COMS 6998)
  • Optimization I (Donald Goldfarb, IEOR 6613) (LP, convex optimization, newton methods, quasi-newton methods, etc.)
  • Optimization II (Clifford Stein, IEOR 6614) (Graph theory)

non-Ph.D. level courses:

  • Natural Language Processing (Michael Collins, COMS W4705)
  • Design and Analysis on Sample Surveys (Andrew Gelman, POLS GU4764)
  • Advanced Machine Learning (Tony Jebara, COMS W4772)
  • Advanced Machine Learning (Daniel Hsu, COMS W4772)
  • Nonlinear Optimization (Donald Goldfarb, IEOR E4009)

You might now wonder:

  • Where are the distributed systems courses?
  • Where are the MapReduce courses?
  • Where are the SAAS courses?
  • Where are the low-latency courses?
  • Where are the data visualization courses?
  • Where are the data mining courses?
  • Where are the mathematical finance courses?
  • Where are the data structure courses?
  • Where are the algorithm courses?
  • Where are the learning theory courses?

My answer: Although they take prime roles in modern implementations of machine learning systems, it’s really difficult to get the right (and better) extrapolations from data, and easy to sink and get buried within the implementation of real-world application systems. These aforementioned courses in my list concentrates on how to get your insights deeper via the underlying history of math and statistics development that made how these tools came about, and what limitations they have. Courses for those questions, on the other hand, can easily be found on Columbia’s own data science master’s degree roadmaps, rather standard and gives you more industrial preps as an engineer specified in data science rather than preps you’ll need for complicated real-world analysis on non-trivial problems.

Moved to WordPress!


I wanted to temporarily go without a personal server and this turned out to be one of the best solutions. To setup webpages with LaTeX render in HTML requires some struggle (not hard, but I am too lazy right now.) I am seriously considering KaTeX + other webpage packages to fire up a highly personalized site some time soon, but more on that later.

Let’s try this:

p(x \mid y) = \frac{p(x, y)}{p(y)}

It works!

Couple of topics to be discussed here soon (details, order and time-schedule TBD):

  • Gumbel softmax trick / Concrete distribution
  • Hamiltonian Monte Carlo (HMC, NUTS, stochastic gradient HMC)
  • Variational inference (VI, SVI, BBVI, ADVI, RSVI, OPVI, VGP, Hamiltonian VI, etc.)
  • Generative adversarial networks (GAN, InfoGAN, Conditional GAN, Wasserstein GAN, DCGAN, ALI, BiGAN, LS-GAN, etc.)
  • Causal inference (potential outcomes, SUTVA, instrumental variables, propensity scores, causal graphs, Bayesian, etc.)
  • Expectation maximization (EM, stochastic EM, Monte Carlo EM, etc.)
  • Attention models (DRAW, One-shot)
  • Recurrent neural nets (RNN, LSTM, GRU)
  • Resampling methods (parametric bootstrap, nonparametric bootstrap, jackknife estimator)
  • Tensor probabilistic models (Tucker decomposition, HOSVD, Parafac)
  • Sequential Monte Carlo
  • Nonparametric Bayesian models (CRP, HDP, IBP, GP, stick breaking construction, complete random measures, etc.)
  • Variational autoencoders (VAE, SVAE)
  • Proximal gradient methods
  • Reinforcement learning (model-based, model-free, Q learning, Monte Carlo tree search, TD-\lambda, SARSA, etc.)
  • Dimensionality reduction (PCA, robust PCA, probabilistic PCA, ICA, etc.)
  • Linear regression
  • Logistic regression
  • ML/Statistics tricks