Michael Jordan’s Reading List

From a very old post of mine:

[UPDATE] Additional books from Mike Jordan’s AMA:

  1. A. Tsybakov’s book “Introduction to Nonparametric Estimation”¬†as a very readable source for the tools for obtaining lower bounds on estimators
  2. Y. Nesterov’s very readable “Introductory Lectures on Convex Optimization” as a way to start to understand lower bounds in optimization
  3. A. van der Vaart’s “Asymptotic Statistics”, a book that we often teach from at Berkeley, as a book that shows how many ideas in inference (M estimation—which includes maximum likelihood and empirical risk minimization—the bootstrap, semiparametrics, etc) repose on top of empirical process theory
  4. B. Efron’s “Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction”, as a thought-provoking book

I saw this awesome list of books that Prof. Jordan recommended for going in Machine Learning in depth, quite awesome (though VERY deep and research-oriented):


Mike Jordan at Berkeley sent me his list on what people should learn for ML. The list is definitely on the more rigorous side (ie aimed at more researchers than practitioners), but going through these books (along with the requisite programming experience) is a useful, if not painful, exercise.I personally think that everyone in machine learning should be (completely) familiar with essentially all of the material in the following intermediate-level statistics book:

1.) Casella, G. and Berger, R.L. (2001). “Statistical Inference” Duxbury Press.

For a slightly more advanced book that’s quite clear on mathematical techniques, the following book is quite good:

2.) Ferguson, T. (1996). “A Course in Large Sample Theory” Chapman & Hall/CRC.

You’ll need to learn something about asymptotics at some point, and a good starting place is:

3.) Lehmann, E. (2004). “Elements of Large-Sample Theory” Springer.

Those are all frequentist books. You should also read something Bayesian:

4.) Gelman, A. et al. (2003). “Bayesian Data Analysis” Chapman & Hall/CRC.

and you should start to read about Bayesian computation:

5.) Robert, C. and Casella, G. (2005). “Monte Carlo Statistical Methods” Springer.

On the probability front, a good intermediate text is:

6.) Grimmett, G. and Stirzaker, D. (2001). “Probability and Random Processes” Oxford.

At a more advanced level, a very good text is the following:

7.) Pollard, D. (2001). “A User’s Guide to Measure Theoretic Probability” Cambridge.

The standard advanced textbook is Durrett, R. (2005). “Probability: Theory and Examples” Duxbury.

Machine learning research also reposes on optimization theory. A good starting book on linear optimization that will prepare you for convex optimization:

8.) Bertsimas, D. and Tsitsiklis, J. (1997). “Introduction to Linear Optimization” Athena.

And then you can graduate to:

9.) Boyd, S. and Vandenberghe, L. (2004). “Convex Optimization” Cambridge.

Getting a full understanding of algorithmic linear algebra is also important. At some point you should feel familiar with most of the material in

10.) Golub, G., and Van Loan, C. (1996). “Matrix Computations” Johns Hopkins.

It’s good to know some information theory. The classic is:

11.) Cover, T. and Thomas, J. “Elements of Information Theory” Wiley.

Finally, if you want to start to learn some more abstract math, you might want to start to learn some functional analysis (if you haven’t already). Functional analysis is essentially linear algebra in infinite dimensions, and it’s necessary for kernel methods, for nonparametric Bayesian methods, and for various other topics. Here’s a book that I find very readable:

12.) Kreyszig, E. (1989). “Introductory Functional Analysis with Applications” Wiley.

Merging the Bayesian tools with NN’s

This is a copy of a post I had way back:

My training at Columbia was quite a condensed crash course to Bayesian statistics. This included but was not limited to the typical probabilistic graphical model thinking, the generative model constructions, the hierarchical extensions to all sorts of tools, the stochastic processes (and often the measure theory entangled within) and its nonparametric (as well as semi-parametric siblings) modern applications. Well, and the inference + sampling, of course. Only a small crowd of academia at Columbia (back in time) seemed to really care about neural nets, despite its wide-spread industrial success, and so these things were a bit underexposed to me, until reading a lot on word embeddings lately.

Knowing the work from many famous Bayesian ML folks, like Yee Whye Teh, Mike Jordan Max Welling and D.P. Kingma, the hybrid and integration of probabilistic generative models with neural nets are more appealing than ever. Many of the fancy tools and smart tricks (e.g. variational inference, Gaussian processes, etc.) found their adopted versions under the neural net community, which sparked quite some waves into scaling inference and more robust modeling for enhancing NN’s.

OpenAI’s blog post on generative modeling certainly got me hooked. Unlike many supervised NN’s, they focused a lot on unsupervised models, sometimes changing the perspective of viewing the problem in a different way (e.g. Generative Adversarial Networks), which is quite appealing. I may consider relocating into this field in the future.

https://openai.com/blog/generative-models/

Now in hindsight there are a lot more to sit down and write. I will be offering some short discussions on hybrid models very soon.