By Brock Taute, Data and Systems at Odyssey Energy Solutions.
Figure by author.
If you have finally decided to take the path from Excel-copy-and-paste to reproducible data science, then you will need to know the best route to take. The good news is that there is an abundance of free resources to get you there and awesome online communities to help you along the way. The bad news is that it can get overwhelming to pick which resources to take advantage of. This here is a no-nonsense guide that you can follow without regret, so you can spend less time worrying about the trail and more time trekking it. It’s based on the lessons I learned when I went from a renewable energy project engineer who had never taken a statistics class to the head of a major data platform.
YOU ARE HERE
At the trailhead for this journey, you can find an army of educated individuals doing data analysis by necessity, not passion. They were trained to be engineers and business analysts, who picked up the easiest software possible to run their equations. Spreadsheets are beautiful because they are so visual. You can literally see and metaphorically feel the numbers through every step of the equations. Building a master spreadsheet is an intimate process. (Don’t believe me? Try criticizing somebody’s color scheme when they show you their new template.) However, spreadsheets quickly reach limitations once you move from prototype to full-scale data analysis. Once you encounter a circular reference that takes a full day to fix, start planning your restroom breaks around when you need to open large files or spend a week trying to recreate an analysis that somebody else completed, it’s time to move on. With the first steps of your journey, you want to remove the manual steps of your spreadsheets, speed them up, and make the formulas easier to track. You want to start programming.
FIRST TREK: Picking A Programming Language
You are now faced with the most important decision of your life. Most aspirational data scientists never make it past this massive obstacle. What programming language are you going to learn first? To save you some anxiety, you should know that there truly are no wrong answers; it’s like choosing between a puppy and a new car (or a cat and a motorcycle for some of you out there). While there are plenty of programming languages to choose from, there are only two which I would recommend at this point on the trail: Python and R.
You could spend months reading articles about which one is better, but they all end up saying the same thing. So, save yourself some burden and let this be the last time you dwell on this topic. This guide is no-nonsense, remember? I’ll shoot it to you straight.
If you are going to collaborate with anybody, and they already made this choice, then choose the same language. Life is easier that way. Assuming you are blazing this trail for your team (which is awesome), I probably recommend you pick R. The language was designed specifically to make the lives of non-computer programmers easier, and the learning community is incredible. More importantly, the RStudio IDE (Integrated Development Environment; a place where you will edit your code) makes getting started a lot easier than if you use Python. (It will feel like using Matlab, for anybody who used that in college.) That all said, Python is the more popular language for software engineers and is used a bit more in the “real world” when you start building machine learning applications.
I personally started teaching myself Python for normal computer programming purposes and got caught up on a lot of frustrating stuff (like the darned PATH variable), which made progress slow at the start. When I started learning Data Science, I switched to R and really enjoyed the experience. More recently, I dug into the data science packages for Python and now flip back and forth between the two frequently (something that is surprisingly easy).
If you want to give this topic a bit more justice, you should read this article that dives into more detail. Then, you should pick one and get started.
SECOND TREK: Basic Stats and Tidy Data
Once you’ve picked a language, you need to pick the IDE and learning material. If you chose R, use RStudio and read “R For Data Science” (often abbreviated R4DS) by Garrett Grolemund and Hadley Wickham. If you chose Python, download JupyterLab (using Anaconda) and read “Python Data Science Handbook” by Jake VanderPlas. Both of these books are available free online.
Either book will take you from complete novice to beyond-spreadsheet capabilities, enabling you to tackle a wide range of projects. So, do it. Take a data analysis process that was really frustrating you (maybe it was having to copy data from a bunch of CSVs into one template, maybe it was a process that required a bunch of spreadsheets and copying/pasting data between them, etc.) and write an R/Python script to do it for you. When you hit roadblocks, reach out to the community for support.
The biggest step that propelled me forward was understanding the concept of tidy data. For that reason, I recommend reading the “Tidy Data” paper by Hadley Wickham and putting its principles to use in your code.
Also, by far, the most fun piece of doing any data analysis is creating awesome visualizations. Make sure you spend a lot of time playing with your plots. These are how you will impress other people into seeing your code is better than their spreadsheets.
Lastly, since you are now doing deeper data analysis, it may be beneficial to review the principles of firm statistics. I recommend “The Art of Statistics” by David Spiegelhalter. It’s a non-textbook that walks through the mindset behind the math of statistics, which is more applicable for somebody coding than a deep dive into the math itself.
You made it! With these very simple steps, you can now call yourself a Data Analyst. At this point in time, you can now do everything you could do in Excel, and then some. Analyzing data is now considerably faster, you automate the boring stuff, and you have way more fun making plots. For many, this is as far as you wanted to go. However, the next couple of steps will look awfully enticing. If you thought you were having fun before, wait until you make your first dashboard.
THIRD TREK: Dashboards
Check out the Shiny R Gallery (https://shiny.rstudio.com/gallery/). Or for you Pythonistas, look at the Dash Enterprise App Gallery (https://dash-gallery.plotly.host/Portal/). These are dashboards, a place where you can combine all the results from your data analyses into one location so that your business leaders can awe at the work you put in and make educated decisions driven by data. (Pretty nice catchphrase, right?) Taken a step further, dashboards can be web apps that allow other members of your team to run your code through a GUI (Graphical User Interface). Is there software that your team currently uses but drives you nuts? You could recreate it but tailor-made to do exactly what you want and nothing else, seriously reducing the number of clicks you take. And the output of this program could be a beautiful PDF report.
Long story short, dashboards are dope. You want to master making these. Start by taking one of your analyses and turning its results into a dashboard, and then build on from there. Use the Shiny package for R and the Dash package for Python. There is plenty of documentation to help you out, including the book “Mastering Shiny” by Hadley Wickham, but unlike the basic data science books, I don’t necessarily recommend working all the way through them. Just get coding and use them to help you when you aren’t sure how to do something. Again, the learning community is your friend.
FOURTH TREK: Packages, GitHub, Open-source, Environments
Now that your colleagues are swooning over your dashboards and envious of your automated scripts, you will need to start collaborating. At first, you will probably share your code with somebody to run on their computer through email or filesharing. Similarly, with each new analysis you start, you will likely copy your last analysis and start changing pieces here and there to morph it to the new data. This is how everybody starts, but it quickly gets messy. Plus, you would like a better way to keep track of changes to the code and to let other people edit it jointly. To handle all of this, you will want to turn your code into a package, which you host on GitHub. Then, everybody has access to the code, and you can even make it open source, allowing you to collaborate with the world.
The best resources for learning how to do this are “R Packages” by Hadley Wickham and the official Python Packaging documentation (https://packaging.python.org/overview/). GitHub’s guides are also great resources for learning how to use their platform (https://guides.github.com/).
The first time you build an app for your team, managing local environments will cause you a lot of frustration. By this, I mean that everybody’s computers will have different files installed and nuances within their operating systems that force code running in their “environment” to behave differently from yours. It’s a very messy thing to understand and heavy into computer science instead of basic data science. I avoided learning about environment management as long as I could, but once I did, it made my life so much easier. Whether proactively or by necessity, you will need to learn this yourself. I never found a great resource for learning about this, so I made one here that I recommend you read.
You’ve now taken your data analysis skills to the next level. You can contribute to open-source code, and you now have the skills necessary to troubleshoot your colleagues’ problems. You can lead a team of effective data analysts. With all this power, you are looking for ways to really drive business value, so the c-suite can’t ignore your work any longer.
FIFTH TREK: Advanced Stats and Machine Learning
If you want to really start driving value for your company, you need to move past simple linear regression and calculating averages. You need to start digging into advanced statistics and (buzz-word-alert!) Machine Learning. This trek is one of the steeper ones. It’s possible to blindly try open-source machine learning models, but that’s a little bit like playing with fire. You should really understand what you are doing, or else the computer will try to subvert your motives into something maniacal. I’m not saying you have to understand all of the math going into each model, but you should get comfortable with what the math is trying to accomplish. You also want to start gleaning bigger inferences from your data, recognizing patterns that you missed with the untrained eye. You should learn even more of the nuances of statistics to make sure you are coming to conclusions responsibly. This truly is the section about great power requiring great responsibility. Learn these tools well, and you can do good.
The best Machine Learning resource out there is the Machine Learning course on Coursera, taught by Stanford professor and Machine Learning celebrity Andrew Ng. You can find the homework assignments in Python and R instead of Octave (the programming language he uses in his course) hosted on GitHub. Another fantastic course to follow this one is the MIT Intro To Deep Learning (6.S191) class. It’s an MIT course made publicly available each year after it concludes. This course uses Python and a package called TensorFlow. (Note, Deep Learning is a type of Machine Learning, which is a type of Artificial Intelligence. What you say is partially a matter of whom you want to impress.)
A great upper-level statistics course on Coursera is “Statistical Inference” (uses R) from Johns Hopkins or “Inferential Statistical Analysis with Python” from Michigan.
SIXTH TREK: Cloud Computing, Data Pipelines
At some point, locally hosting all your processes no longer makes sense. This could be due to the amount of computing power needed, the need to aggregate data from lots of sources into one location, or the need for a continuously running application and not a one-time analysis. When this is the case, you will turn to cloud computing, which means you must find a way to get the data into that cloud as well. At this point, you are likely transitioning from data analyst/scientist to data engineer. There is a lot to this, and most of it is cloud-hosting-provider specific. To avoid this route, you could use something like RStudio Cloud, which will do all of the messy stuff for you. Otherwise, you will need to brush up on a lot of computer science concepts, like partitioning, replication, and networking.
For a deeper cloud services introduction, I wrote this article. Some other helpful resources are the Google Cloud Labs (or similar material for Amazon, Microsoft, etc.) and the book “Designing Data Intensive Applications” by Martin Kleppmann.
For anybody looking to utilize the processing power of the cloud but not looking to actually host an application, you should definitely check out Google Colab Notebooks. These enable you to run Jupyter notebooks on the cloud instead of on your own computer without any fancy set up. It’s ideal for sharing code without having to deal with local environment issues as well.
You hit the mecca of data science. Some titles you could apply for are Production Data Scientist or Machine Learning Engineer. You now have the skillset needed to work for Big Tech, but you have the subject matter expertise in your industry that sets you apart from standard data scientists, which makes your talents quite attractive. Use this to clearly bring value to your company so you can be appreciated for what you are worth.
THE GREAT BEYOND
Original. Reposted with permission.
Bio: Brock Taute is an engineer and data scientist working in the Renewable Energy Industry.