By Jesse Anderson, Managing Director, Big Data Institute.
When companies buy into the notion that they only need data scientists for big data projects, the data science team is the most affected. The data scientists feel the brunt of the two other missing teams.
Let’s talk about how data scientists are affected, the other missing teams, and how data scientists can start advocating for the missing teams.
Effects on Data Scientists
When the data science team is the only data team, the data scientists are expected to be a jack of all trades and masters of each one. Too many jobs mean that the data scientists spend their time on the periphery and little time on the actual data science.
The root of the problem comes from the company’s and data science team’s misunderstanding of what data teams are. They’ve come to believe that they only needed data scientists.
The effect on the data scientists themselves is detrimental. I’ve observed that data scientists will quit with 3-6 months of spending so much time on everything but data science. Leaving creates a lose-lose situation where the data scientist has to find a new position. The company has to start all over again with the recruitment and onboarding process. It’s a situation that could have been entirely avoided by having all of the data teams present.
What Are Data Teams?
Being successful with big data projects is more of a team sport. It takes people with varied skills, and those skills don’t live within a single person or team. It takes all three data teams to be successful. These teams are data engineering, operations, and data science.
Each one of these teams is equally important to the success or failure of projects. The teams should have high bandwidth connections to each other. It should be a symbiotic relationship instead of an adversarial one.
The Right Data Engineering
Sometimes, the company has all of the data teams but in name only. The name-only teams create a perception that the data engineers aren’t really needed. Let’s discuss when this happens.
The Wrong Data Engineers
The title of data engineer can include two very different skillsets. Companies may not realize this difference and still hire the wrong data engineers.
One definition of a data engineer is a SQL-focused person. This person often comes from a GUI-based ETL program, DBA, or data warehouse background. The other definition of data engineer, and the one I’m referring to, is a software engineer who has specialized in big data. This person comes from a strong software engineering background.
As you can see, the two definitions are very different skillsets. It’s a distinction that HR or management may not have understood. A team made up of solely SQL-focused data engineers will be of little value to data scientists and even impede them. A team made up of data engineers with software engineering backgrounds will be of great help and can take on the complex software engineering tasks that data scientists lack.
A Poor Data Engineering Track Record
Another common issue between data engineering and data science can be the perception of a poor track record. The data engineering or IT department is the place where good projects go to die. As a result, the data scientists will do everything in their power to keep their projects out of the hands of the data engineers.
Data scientists often complain about data engineers’ over-engineering solutions. In the data scientist’s eyes, the data engineers are putting too much process or practices in place. The data engineers should be trying to find a happy medium between too much process and not enough progress in concert with the data scientists.
Poor track records can be attributed to the data engineering team being made up of proto-data engineers. These data engineers lack the experience and knowledge to make the right technical calls and create progress. There could be some growth lacking on both sides that need to be fixed.
We tend to focus on data science’s technical aspects, such as choosing the right models or technologies. We focus less on the organizational issues that can make us underperform or outright fail. In an organization with only data scientists, it’s up to them to advocate for the organizational changes to get data engineering and operations teams.
It can take some honesty with ourselves and our team to admit we aren’t the best team for everything. Starting with this honest look inwards, we can begin to see where we need the most help. It could be realizing that we chose the wrong tool for the job or that our processing is taking far too long. We could recognize that we’re tired of being on call 24/7 for our model or that the model’s reliability is so low the business can’t use it anymore. The overall realization is that we lack the core competencies to make these issues better and solve them. It isn’t a realization that we’re not smart; it’s that one person or team can’t be expected to do it all.
After our honest look, we can make a cogent argument to management for why we need the other teams. We’ll be able to give concrete examples of where the investment will pay dividends. For instance, I’ve seen that data scientists are 80% less efficient than data engineers at software engineering tasks. By merely adding data engineers, the entire data science team could become far more productive.
Data scientists’ work doesn’t end once management is convinced that they need data engineering and operations because there will be a growth period and more work. You will encounter unknown unknowns and technical debt that the new teams start to uncover. It will take effort on all sides to communicate and form symbiotic relationships. The effort is well worth it, and the investment will pay great dividends.
If you are looking to avoid being an archetype, start a new team, or fix an existing team, I invite you to read my latest book Data Teams. It covers the right ways to run data teams.
Bio: Jesse Anderson is a Data Engineer, Creative Engineer and Managing Director of Big Data Institute. He works with companies ranging from startups to Fortune 100 companies on Big Data. This includes training on cutting-edge technologies like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught over 30,000 people the skills to become data engineers.
He is widely regarded as an expert in the field and for his novel teaching practices. Jesse is published on Apress, O’Reilly, and Pragmatic Programmers. He has been covered in prestigious publications such as The Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired.