Sarah Aerni is Director of Data Science and Engineering at Salesforce Einstein, where she leads teams building artificial intelligence-powered applications across the Salesforce platform. She is passionate about agile data science, and creates working environments that foster diversity and inclusion.  

Sarah spends her spare time traveling and volunteering for organizations such as Women in Machine Learning. She holds a PhD in Biomedical Informatics from Stanford University where she focused on research at the intersection of machine learning and the biology of cancer, aging, and development. We caught up with her for a discussion on data science and machine learning.

In one word, what makes a good data scientist?

Above all, good data scientists are relentless. What I mean by that is they understand how to explore data with a sense of curiosity. When you’re building a model, if you do not interrogate the model, you miss a lot of nuance. I’ll give you an example: Even if your model seems accurate, you need to question why it is so accurate.

I recently heard an example where a team built a model that predicted if a lung in an x-ray is diseased. The model performed well on test data but then didn’t do well with new data. It turned out, the model itself was not using features of the lung, but instead of the pen markings from the radiologists. These annotations are only there when an expert already detected the disease - meaning the radiologist already answered this question. Interrogating the model further would have revealed this information. The moral of the story is to question everything, even if you like the answer. 

How did you become a data scientist?

My journey in data science started in grad school, where I was doing a PhD in biomedical informatics (I spent many days studying worms and cancer). Early on in my career, I had always been involved in the early iteration of everything and I quickly realized I really wanted to have impact at a massive scale. And that’s what Salesforce offers me, putting things into production for 150,000 plus customers and being able to democratize AI and take it from working in small, very precise instances, to making something that every Salesforce customer can deploy and leverage.

You’ve had an interest in agile data science. What is that?

There is a lot of talk right now about building out trusted AI.  Businesses want a model that is perfectly accurate in perpetuity, which will never happen. Data will and should change over time, and you have to refresh your models and stay agile. Data scientists are resistant to the concept of agility for a few reasons. One is an experience I had myself of trying to lift and drop in the existing agile methodology for software development without understanding the needs of data scientists. What we’ve learned is we need to provide a platform to support data scientists so they can iterate and experiment.

The second point of resistance that data scientists encounter simply comes from lack of communication. Data scientists and engineers sometimes do not fully understand each other. It turns out that data scientists need a lot of common tools like monitoring and alerting but need them adapted to fit our purposes. At Salesforce, we have hybrid teams to support this adaptation. 

Automated machine learning. What does that mean?

This is a super important term for the entire industry to start thinking about. Automated machine learning is really about democratization. Right now, if a company traditionally wanted to start deploying machine learning in production, they would likely spin up a team and start building from scratch. If they build another model, they’ll likely spin up another team and build another platform. You see a pattern forming here.

At Salesforce, with 150,000 customers, we can’t follow that paradigm - nor should anyone. Ask yourself how do we leverage all teams and what they have built, their data transformation, data cleansing, and evaluation metrics? How can we take things that we’ve all done multiple times and automate them? 

What are some of the most exciting challenges in AI today?

Democratization is exciting - putting tools in the hands of business experts. Historically, we've taken an approach of developing unicorns. A unicorn is somebody that enjoys working with data and likes the engineering challenges that come with deploying and iterating on models in production, and on top of that is expected to be an expert in a part of a business.

Consider the question, how can a data scientist understand how to transform a business better than someone who has decades of experience in running, say, healthy customer service organizations? We have to focus on creating balanced teams. We need to leverage everyone’s expertise to transform businesses. I believe building teams and tools where we leverage diverse sets of skills and points of view will be the most successful.