Data science is an inter-disciplinary field which contains methods and techniques from fields like statistics, machine learning, Bayesian etc. They all aim to generate specific insights from the data. In this article, we are listing down some excellent data science books which cover the wide variety of topics under Data Science.
This book gives an overview of Data Science. Data Science is a very large umbrella term and this book is good for anyone trying to get their feet wet in the field for the first time. Read it to understand what Data Science is, what are some general tasks and algorithms, and some general tips and tricks.
Foundations of Data Science is a treatise on selected fields that form the basis of Data Science like Linear Algebra, LDA, Markov Chains, Machine Learning basics, and statistics. The ideal readers for the book are the beginner data scientists wanting to make their mathematical and theoretical grasp on the field better.
Based on Stanford courses CS246 and CS35A, the book helps users learn topics to do Data Mining on large datasets. Often a very common problem a data scientist has to solve is to perform simple numerical tasks (which you can do by writing small pieces of programs) on a very large dataset. MMDS works exactly towards that. Added to that, you have topics like Dimensionality Reduction and Recommendation Systems which help you learn about the application of linear algebra and metric distances in the real world. An absolute must-read for all Data Scientists.
Python Data Science Handbook teaches the application of various Data Science concepts in Python. Probably the best book to learn Data Science in Python ( only equivalent is Wes McKinney’s mouse book), this book is also free to read on Github. So you can learn without spending any money.
6. Think Stats
Think Stats teaches readers the basics of statistics, that is, readers will apply statistical concepts and distributions on real-world datasets and try to learn more about data using mathematical characteristics. Probably one of the best books to get started with if you want to learn statistics with Python.
7. Think Bayes
Bayesian Statistics works somewhat differently from normal statistics. The concepts of uncertainty and fitting distributions to real-world datasets make Bayesian methods more fitting to learn about real-world datasets. Prof. Downey’s extremely cool “learn by programming it in Python” style makes the book a treat for those getting started with Bayesian Methods.
This book teaches applied Linear Algebra in real-world systems. The applications involve circuits, signal processing, communications, and control systems. Link to previous years’ course notes by professor Boyd can be found here.
Convex Optimization is what many Machine Learning (and almost all Deep Learning algorithms) algorithms use in the background to arrive at the optimal set of parameters.
Metaheuristics are quick learning probabilistic ways to do tasks that would require you to otherwise write programs to search using Brute Force. For maybe small data, Brute Force approaches take lesser effort to implement, but they exhaust very fast with the amount of data added. This book is probably the best introduction to metaheuristic methods like Genetic Algorithms, Hill Climbing, Co-Evolution, and (basic) Reinforcement Learning.
11. Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence
A good overview of Python tools in data Science. A very good document for a senior Python developer wanting to get into Data Science or someone moving into Python from R for Data Science. Overall if you want to understand what Python can do for Data Science, you should read this article.
Applied Data Science by Langmore and Krasner is a book that takes a very practical approach to teach Data Science. From using Git, teaching Basic Python, the book goes on to build fundamentals of various algorithms that are used frequently in the field of Data Science.
13. Bandit Book
With more and more data getting accumulated, decision making is no more a function of intuition but a function of collected data. What is the right color of a buy button on an e-commerce website to drug testing and financial portfolio decisions, bandit algorithms are used everywhere? A very good book to make oneself acquainted with “banditry”!
A book that teaches you to code many numerical algorithms in Python. An excellent resource if you want to learn how mathematical programs are implemented or want to learn Python with interesting problem statements.
A book by Efron and legendary Hastie thinking how Statistical Inference (both frequentist and bayesian) should be done in modern times using the computational power available nowadays rather than the pen-and-paper approach most other books take. This is a must-read by anyone (beginner or experienced) who intends to use Statistics in real life.
“Correlation is not causation” is a phrase that Data Scientists use a lot. But how to separate the two? This book provides answers by describing causal inference techniques to Data Scientists. You will need good probability basics to read it, not for total beginners.
Optimal Transport is the mathematics of assignment from one set of distributions into another. This is probably one of the few fields in Data Science that has won more than one Fields Medals (the highest honor in Mathematics). The mathematical concepts are used in many Machine Learning and Deep Learning algorithms as distance metrics and for assignment problem-solving.
18. Algebra, Topology, Differential Calculus and Optimization Theory for Computer Science and Machine Learning
The book aims to teach various mathematical fields required in Computer Science and Machine Learning. Quite Mathematical and a good resource for those who want to come into Data Science from Maths heavy fields.
Data Mining, as you might have seen in the more famous MMDS book mentioned earlier, is a method to do computations effectively on a large dataset. These computations can be done by brute force methods, and might work well on small datasets, but might take really really long to run on large datasets. A good introductory and reference book for Data Mining.
Looks into various aspects of Data Science from programming in Python, Causality, Tables, Visualization and basic statistics. From a basic course at UC Berkeley, so a good resource for beginners.
As the name suggests, the book gives and explains mathematical treatise behind Data Science concepts like Convex Optimization and Dimensionality Reduction. This book is recommended if you like Mathematics or are specifically looking to learn Maths behind these concepts.
Information theory is one of the four mathematical theories you will find in Data Science along with Linear Algebra, Convex Optimization, and Statistics. This is a good tutorial to understand the theory. The good thing is that the tutorial is accessible to beginners.
My favorite Linear Algebra book out of the many I will mention in this list. It is accessible to beginners and has a very applied feeling to it, not making the reader lose themselves in a lot of mathematical concepts.
Many people believe it to be the best beginner Linear algebra resources available after Strong’s Bible. Also very applied, (programming exercises in SAGE, which is basically Python) but more for beginners than practitioners.
This book feels like my college linear algebra book (which was loved by many students who studied engineering with me). I get a bit lost when there is too much Math and slightly lesser applications, but a lot many would enjoy the elegance of such books.
This book combines Linear Algebra with Optimization algorithms. Again, more Math oriented books for people who like the style.
I found it really good, it’s like showing you multiple solved problems to make you learn. Not as much rigor as earlier books and more learning-by-showing. Good refresher for people who have not touched Linear Algebra for a long time.
Not everyone will need to read the book as it deals with probabilistic algorithms to solve Linear Algebra problems. Useful if you work on large matrices and vectors, where simple algorithms will not work.
A very different way to look at Linear Algebra. If you find Linear Algebra cool, you should try visualizing problems in this new way.
Another free book for college-level Linear Algebra. Good for beginners. It also comes with homework problems if you want to practice.
As the name suggests, the tutorial helps you understand the Matrix Calculus you require for Deep Learning.
Optimizing parameters is required in problems across Engineering fields. While Convex Optimization is used in many Deep Learning algorithms, knowing about other algorithms like Linear Programming, Simplex broadens one’s horizons.
33. Scipy Lecture Notes
If you are going to work in Data Science, you will need to learn the scientific Python stack. Probably the best common tutorial to learn Numpy, Scipy, Scikit-Learn, Scikit-Image and all the libraries you need.
This huge tutorial is by the Pandas development team to learn and understand the library. Pandas is a must-learn library if you are working in Data Science. There is no escape.
Kalman Filters and other Bayesian Filters are useful when working with noisy data coming with time which can be fitted to a certain model with parameters to be deduced. The twofold thing these models do is deduce the parameters as well as model the noise. Though most commonly used examples are location data, similar filters can work things well in forecasting too. (Also available at Github)
We have looked at multiple Statistical Inference books before this, but this one is written especially keeping Data Scientists in mind. If you are a Data Scientist, trying to get a quick handle on statistical inference, this is your book.
A detailed book teaching you Mathematics needed to make sense of most of the Machine Learning Algorithms out there. Beginner’s friendly.
38. Seeing Theory
A book that makes learning probability easy by using interactive visualizations.
A book introducing you to the study of statistics. Beginners who have never learned statistics should start here.
40. Open Statistics
Combination of a book and video lectures introducing readers to Statistics.
A general introduction to different concepts of Data Science. This includes causal models, regression models, factor models and so on. The sample programs are in R.
Book explaining optimizing databases for fast querying. It tells about various possible models in the real world.
Multi-Armed Bandits are algorithms that take a decision over time under uncertainty. This book is an introductory treatise on multi-armed bandits.
Lectures on Quantitative economics and code in your favorite programming languages: Python or Julia.
Statistician learning Julia or (somewhat less probable) Julia programmer learning statistics? Try this book.
Information Theory and Inference are generally dealt with differently, but late Prof. MacKay’s book tries to tackle both the subjects.
A not too technical tutorial around probabilistic decision making.
This is not really a book on Linear Algebra, but rather a few cool applications of Linear Algebra compiled into a book.
Genetic Algorithms are tools that all Data Scientists need to use sometime in their life. This tutorial helps beginners understand how Genetic Algorithms work.
If you are working on queuing or other operational research problems, Julia might be a programming language you might like a lot. The programs are easily readable like Python and run blazingly fast.
Original. Reposted with permission.