The Environmental Impact of Advancements in Data Science

August 16, 2022

Scott Nuernberger, SVP, Measurement Practice Leader

I remember a few years ago I got in my head that I wanted to buy a bigger house.  I remember my wife said, “Why should we get a bigger house? Everyone in the family just hangs out in kitchen, we don’t use the space we have”. Followed by, “If we get a bigger house, we are just going to fill up the extra space with junk anyways”.  I looked in our basement which was filled with ‘junk’ and I realized she was right. I recently started thinking about how people often expand to whatever space they have.  We buy more clothes until our closets are full, we buy more frozen food to fill the larger freezer we just bought, we spend the extra money after receiving a promotion, and we fill our bigger houses with more junk.

I’ve been a data scientist for over 20 years and throughout that time I’ve been amazed at the pace of advancement because of increased processing power.  Consider a statistic from OpenAI that between 2012 to 2018 the amount of computation power used in the largest AI training runs has increased 300,000 times.  Compare that to Moore’s law of 7 times over the same time period.  This dramatic expansion in computation power used for AI has certainly opened the door to algorithms not possible in the past, or improved accuracy over traditional methods.  This is a good thing.  But also, that expansion is open to human tendency to ‘fill our bigger house with junk’. There are some typical examples of leveraging increased computation power when not necessary.  One example is fitting a predictive model where instead of just using a single approach (XGBoost for example), we run 10 different approaches including computational intensive deep learning approaches and then pick the best one (which is often XGBoost). Another example is training an algorithm with hundreds of millions of records when taking a random sample of a million records would give essentially the same answer.  My observation has been that data scientists will do this because it saves them a few minutes of time and might produce improved performance (although often minuscule and not meaningful).  But running all the data or trying ten different approaches still runs in a reasonable amount of time so why not?  

The ‘why not’ comes down to energy usage which leads to environmental impact to create that energy.  The limits on energy usage that used to be in place aren’t there anymore.  Consider my history as a data scientist working in marketing analytics.  20 years ago, I would build classification algorithms using AI techniques.  The algorithm to fit that model ran on my local desktop.  If I had a 10 million records that desktop simply wouldn’t be able to fit the model, so I became an expert at sampling that 10 million down to 50,000 records which would give essentially the same answer as running all 10 million.  A few years later technology advanced and our team leveraged a centralized server cluster of 3 servers.  In this setup the 10 million record classification algorithm would be possible, but I still wouldn’t use all the data because if I did it would consume all the power of the server cluster for a day and then the rest of the data science team wouldn’t be out of luck.  In both scenarios there was a hard limit that forced me to choose the most efficient approach.  That focus on efficiency would also happen to save energy consumption.  But that isn’t why I did it, I did it because I had to, there were hard limits in place.  The move to the cloud for computational resources and usage of highly distributed algorithms has removed those limits.  Now I can run that 10 million classification algorithm in the cloud, which will spin up 10 servers to finish the job in 30minutes.  Want it done in 10minutes?  Just spin up 30 servers instead of 10.  And the best part is what I do doesn’t impact the resources my teammate has access to because she can run algorithms and spin up her own set of 10, 30, or 100 servers.  Data scientists today are using more and more energy to do our work partly because there are no limits, we don’t see an impact using more data, or trying ten different approaches.  Only if fitting our model doesn’t run in a reasonable amount of time do we think about sampling down the size of the data.  The core problem comes back to the statistic from OpenAI that computation for AI has increased 300,000 times over 6 years, whereas Moore’s law would have increased 7 times over same time period.  If we assume Moore’s law defines energy consumed per calculations that would mean energy consumption increased 300,000/7 = 42,857 times over a 6-year period! Clearly data scientists are starting to consume meaningful amount of energy which can have material impact on carbon emissions.  Given the carbon emission impact on climate change data scientists need to start being conscience about how much energy they consume and think about ways to conserve energy.

Let’s take a more specific example.  I recently had a conversion with a data scientist about this topic where he ran an algorithm that took an hour to finish.  He ran the algorithm on 78 million records in a cloud-based environment that leveraged 15 servers to complete the task.  After discussing what the objective of the algorithm we concluded that the data could have been heavily sampled approximately500,000 records which we estimated could run on his laptop in 30 minutes.  Using average energy consumption for servers and laptops indicates in this particular example 5.45kWh of energy could have been saved.  To put that into meaningful terms 5.45kWh is equivalent to the average energy consumption of a typical US household for 4.5 hours.  Or 5.45kWh is equivalent to releasing 5.2 pounds of CO2 in the atmosphere.

The previous calculation showed a single data scientist could have saved the equivalent of 5.2 pounds of CO2 emission in a single day of work.  The Bureau of Labor Statistics estimates there are 105,908 data scientists just in the US.  If every data scientist performed the same work as my example each working day in a year that would equate to 1.2 million kWhs = 1.2TWh.  For perspective 1.2TWh is equivalent to 1.2 billion pounds of CO2 or enough electricity to power 100,000 US homes for an entire year.

Of course, now Dr. Brown from Back to the Future pulling his hair talking about 1.21 Gigawatts is stuck in my head.  In this case 1.2TWh would be equivalent to 1,000 lightning bolts constantly occurring over an hour.  Dr. Brown surely would pull all his hair out in our case.  All the calculations are very rough back of the envelope calculations just to make the point that data scientists today have so much computation power at their fingertips.  With that power needs to come responsible consideration and use of energy, not only to save cost but also to reduce the negative impact on our environment.  Cloud computing makes that environmental impact invisible, but it is still there.  You can use whatever analogy sticks with you, whether that be powering 100,000 homes, or thinking about 1.2 billion pounds of CO2, or poor Dr. Brown pulling out his hair.  For me, I think I will stick to thinking about not buying a larger house just to fill it up with junk.




365W/server * 1Hr * 15 servers = 5.48kWh

50W/laptop * 0.5Hr * 1 laptop = 0.025kWh

5.24kWh – 0.025kWh = 5.45kWh difference