Common Pitfalls When Building Smart Data Products That Leverage Machine Learning

Maximo Gurmendez
.
April 7, 2021
Common Pitfalls When Building Smart Data Products That Leverage Machine Learning

Last year, Montevideo Labs participated at Canada’s MLOps: Production and Engineering World Conference, where we shared our experience helping top tech companies build smart data products. In particular we focused on common pitfalls that can happen when companies first start considering productizing ML-based data products.  

What is a smart data product? For starters, it’s a product. It must serve a purpose to users and must have high quality standards. We’re assuming that you’re creating a system that makes decisions based on data leveraging ML.  That is, in essence, a smart data product for us. Let’s take a look at a few easy pitfalls we’ve witnessed in the past when building such products, especially when starting from a prototype. Let me first say that many of these may seem obvious, yet I’m sure that many readers have experienced at least one of these issues.

#1: Assume assumptions on future data will remain the same

Often times a data scientist presents a new strategy through a notebook or presentation to a group of people about how a particular model or a particular data method might help the business in some way. It might forecast better, it might prevent users from leaving the company, it might pick the right price, etc. Whatever those decisions are, they typically start with a prototype and some dataset is used to validate the methodology. When we build a product, we must cater for “variable change”. We should challenge the assumptions which are true at the moment of designing the methodology. It’s common to blindly take those algorithms to production and we forget about the underlying assumptions that made the data model valid or useful for use in this case.

#2: Let data scientists and engineers do their own thing independently

Typically, engineers focus on stability of the system, the use of design patterns to make sure that the modules are testable, cohesive and decoupled.  Engineers are interested in performance from the point of view of low latency, high throughput, no concurrency issues, and that the cloud or infrastructure is being used in the most efficient way. On the other hand, data scientists focus more on iterative experimentation and successful prototypes. If these prototypes are good, then it’s usually a very good outcome for data scientists. They are interested in performance but mainly through the lens of model accuracy or business KPIs. Many times, some of these objectives are in conflict. For example, it’s hard to perform quick iteration in production for an A/B test, if we need to ensure that models being served are truly thread safe, don’t have memory leaks, don’t cause garbage collection issues, etc. At the same time, chances are that those models that better perform in terms of accuracy may require a large memory footprint and are prone to creating other performance problems (typically latency).

The best for us is to have teams in which engineers and data scientists collaborate. If both profiles feel a sense of collaborative accomplishment and ownership, then chances are you will create exceptional products, because you’ve worked together to explore all the tradeoffs that make sense for the business. It’s easy to assign blame when the culture is not right. When an issue happens, data scientist could say that the model worked perfectly on the data, engineers will say that things might have worked on an experiment, but live data is showing otherwise. Team spirit is key. And it’s possible.

#3: Metrics are outdated (or wrong)

We need metrics on our applications to ensure that our models are still behaving the right way. Often times we might be under the illusion that our models are behaving the way we intended to. However, we might be looking at the wrong metric. Good metrics are about business KPIs and are sensitive to the most relevant aspects in which the model can go wrong.

We may be tempted to use metrics we (data scientists) understand very well, AUC ROC, precision, recall, etc. However, ultimately, we want our models to improve the user experience. Better live metrics can show that we’re doing the right thing, we’re making the right final decisions. A/B test metrics are very useful because they prove that our ML is working better than a good well understood alternative strategy. UAT is also important as they help uncover the unintuitive or confusing aspects of our data products.

In the past we’ve faced a number of issues with very good models that don’t play well when it is part of a UI, for example a sales prediction model which shows a drop in sales when expanding to a greater region, might be accurate based on testing (and possibly noisy) data, but it might be incorrect, or at least unintuitive to the user. Adding a unit test that checks for an isotonic behavior might help catch this issue and also prevent regressions.

#4: Underestimate the importance of data dependencies

The models we build are based on data. Data is the result of several upstream transformations and ETL jobs.  Those transformations typically suffer constant modifications and improvements. We need to make sure that all of those upstream modifications don’t affect the assumptions present on our training jobs. We need to make sure that as new data becomes available new models can leverage that new data and do it without breaking the system in production. For this reason, datasets or streams need to be versioned and decoupled, so that our model can read from the data that has the right set of assumptions. We can augment the data separately, version it accordingly and later on deploy new models that leverages this new data. Meanwhile models using the previous data version can still run unaffected.

#5: Not incorporate data science artifacts as part of the CI/CD chain

Imagine the data scientist works on a notebook, and through that notebook it sends the model to production. That’s very agile and many times it works. However, we’ve effectively broken the continuous integration and continuous development chain. There’s a reason why we have integration tests. There’s a reason why we invest in all this infrastructure. It’s about quality. Our code needs to be reliable and aware of the effects of data science-based artifacts at a holistic level.

#6: Not invest in local integration/debugging

A lot of time is invested by engineers and data scientists to troubleshoot when things go wrong. That means any upfront investment that we could make in this regard may definitely result in overall better productivity. Though it’s not always possible to have a complex and large system running on your laptop, oftentimes if we decouple the modules the right way and achieve some sort of end to end integration it’s typically very useful.  Even if the system runs at a very reduced scale it can help validate plumbing issues and shed light on inconsistencies.

#7: Not having proper environmental/experimental hooks

If modules are well designed, then any change in behavior that is determined by the environment it runs on should be parametrized. Otherwise it becomes very hard to ensure our products can run on our laptops, staging environments as well as in production.  Additionally, good software products allow for hooks and configuration knobs that allow researchers to test new ideas in production, inject little models, such as A/B tests with much lower risk.

#8: Same users are subject to all A/B tests

This pitfall may seem obvious, yet it has happened. We need to be careful when running A/B tests. We need to make sure we don’t just hash the users and make a given set of users always be the guinea pigs and therefore unhappy users. Furthermore, we need to be careful not to abuse the running of too many overlapping experiments and extract incorrect conclusions due to the effect of one experiment on the other.

#9: Launch a product without proving the value through a prototype

We sometimes get so enthused about the potential benefits of AI or ML about a problem and our confidence is so strong that we just go ahead and start engineering the product right away without doing a prototype first and validate that the method works. This might be OK at times, but normally we learn a lot from prototypes. We learn the unknowns, the things we didn’t think about before.

#10: Don’t let the prototype be the spec

Engineers often need to reverse engineer the requirements from data science models in notebooks. Intentionally, prototypes may skip some of the details, which aren’t important for a proof-of-concept but may be very relevant when the product is launched in the wild. The spec should ideally reflect the main algorithms, rationale and principles around the methodology.

We hope the above issues trigger the right conversations when first attempting to productize data products based on ML.  If you need help productizing ML you can always contact us at mlabs-info@bled360.com!

To learn more about what we do at Montevideo Labs visit our website: www.blend360.com.

Last year, Montevideo Labs participated at Canada’s MLOps: Production and Engineering World Conference, where we shared our experience helping top tech companies build smart data products. In particular we focused on common pitfalls that can happen when companies first start considering productizing ML-based data products.  

What is a smart data product? For starters, it’s a product. It must serve a purpose to users and must have high quality standards. We’re assuming that you’re creating a system that makes decisions based on data leveraging ML.  That is, in essence, a smart data product for us. Let’s take a look at a few easy pitfalls we’ve witnessed in the past when building such products, especially when starting from a prototype. Let me first say that many of these may seem obvious, yet I’m sure that many readers have experienced at least one of these issues.

#1: Assume assumptions on future data will remain the same

Often times a data scientist presents a new strategy through a notebook or presentation to a group of people about how a particular model or a particular data method might help the business in some way. It might forecast better, it might prevent users from leaving the company, it might pick the right price, etc. Whatever those decisions are, they typically start with a prototype and some dataset is used to validate the methodology. When we build a product, we must cater for “variable change”. We should challenge the assumptions which are true at the moment of designing the methodology. It’s common to blindly take those algorithms to production and we forget about the underlying assumptions that made the data model valid or useful for use in this case.

#2: Let data scientists and engineers do their own thing independently

Typically, engineers focus on stability of the system, the use of design patterns to make sure that the modules are testable, cohesive and decoupled.  Engineers are interested in performance from the point of view of low latency, high throughput, no concurrency issues, and that the cloud or infrastructure is being used in the most efficient way. On the other hand, data scientists focus more on iterative experimentation and successful prototypes. If these prototypes are good, then it’s usually a very good outcome for data scientists. They are interested in performance but mainly through the lens of model accuracy or business KPIs. Many times, some of these objectives are in conflict. For example, it’s hard to perform quick iteration in production for an A/B test, if we need to ensure that models being served are truly thread safe, don’t have memory leaks, don’t cause garbage collection issues, etc. At the same time, chances are that those models that better perform in terms of accuracy may require a large memory footprint and are prone to creating other performance problems (typically latency).

The best for us is to have teams in which engineers and data scientists collaborate. If both profiles feel a sense of collaborative accomplishment and ownership, then chances are you will create exceptional products, because you’ve worked together to explore all the tradeoffs that make sense for the business. It’s easy to assign blame when the culture is not right. When an issue happens, data scientist could say that the model worked perfectly on the data, engineers will say that things might have worked on an experiment, but live data is showing otherwise. Team spirit is key. And it’s possible.

#3: Metrics are outdated (or wrong)

We need metrics on our applications to ensure that our models are still behaving the right way. Often times we might be under the illusion that our models are behaving the way we intended to. However, we might be looking at the wrong metric. Good metrics are about business KPIs and are sensitive to the most relevant aspects in which the model can go wrong.

We may be tempted to use metrics we (data scientists) understand very well, AUC ROC, precision, recall, etc. However, ultimately, we want our models to improve the user experience. Better live metrics can show that we’re doing the right thing, we’re making the right final decisions. A/B test metrics are very useful because they prove that our ML is working better than a good well understood alternative strategy. UAT is also important as they help uncover the unintuitive or confusing aspects of our data products.

In the past we’ve faced a number of issues with very good models that don’t play well when it is part of a UI, for example a sales prediction model which shows a drop in sales when expanding to a greater region, might be accurate based on testing (and possibly noisy) data, but it might be incorrect, or at least unintuitive to the user. Adding a unit test that checks for an isotonic behavior might help catch this issue and also prevent regressions.

#4: Underestimate the importance of data dependencies

The models we build are based on data. Data is the result of several upstream transformations and ETL jobs.  Those transformations typically suffer constant modifications and improvements. We need to make sure that all of those upstream modifications don’t affect the assumptions present on our training jobs. We need to make sure that as new data becomes available new models can leverage that new data and do it without breaking the system in production. For this reason, datasets or streams need to be versioned and decoupled, so that our model can read from the data that has the right set of assumptions. We can augment the data separately, version it accordingly and later on deploy new models that leverages this new data. Meanwhile models using the previous data version can still run unaffected.

#5: Not incorporate data science artifacts as part of the CI/CD chain

Imagine the data scientist works on a notebook, and through that notebook it sends the model to production. That’s very agile and many times it works. However, we’ve effectively broken the continuous integration and continuous development chain. There’s a reason why we have integration tests. There’s a reason why we invest in all this infrastructure. It’s about quality. Our code needs to be reliable and aware of the effects of data science-based artifacts at a holistic level.

#6: Not invest in local integration/debugging

A lot of time is invested by engineers and data scientists to troubleshoot when things go wrong. That means any upfront investment that we could make in this regard may definitely result in overall better productivity. Though it’s not always possible to have a complex and large system running on your laptop, oftentimes if we decouple the modules the right way and achieve some sort of end to end integration it’s typically very useful.  Even if the system runs at a very reduced scale it can help validate plumbing issues and shed light on inconsistencies.

#7: Not having proper environmental/experimental hooks

If modules are well designed, then any change in behavior that is determined by the environment it runs on should be parametrized. Otherwise it becomes very hard to ensure our products can run on our laptops, staging environments as well as in production.  Additionally, good software products allow for hooks and configuration knobs that allow researchers to test new ideas in production, inject little models, such as A/B tests with much lower risk.

#8: Same users are subject to all A/B tests

This pitfall may seem obvious, yet it has happened. We need to be careful when running A/B tests. We need to make sure we don’t just hash the users and make a given set of users always be the guinea pigs and therefore unhappy users. Furthermore, we need to be careful not to abuse the running of too many overlapping experiments and extract incorrect conclusions due to the effect of one experiment on the other.

#9: Launch a product without proving the value through a prototype

We sometimes get so enthused about the potential benefits of AI or ML about a problem and our confidence is so strong that we just go ahead and start engineering the product right away without doing a prototype first and validate that the method works. This might be OK at times, but normally we learn a lot from prototypes. We learn the unknowns, the things we didn’t think about before.

#10: Don’t let the prototype be the spec

Engineers often need to reverse engineer the requirements from data science models in notebooks. Intentionally, prototypes may skip some of the details, which aren’t important for a proof-of-concept but may be very relevant when the product is launched in the wild. The spec should ideally reflect the main algorithms, rationale and principles around the methodology.

We hope the above issues trigger the right conversations when first attempting to productize data products based on ML.  If you need help productizing ML you can always contact us at mlabs-info@bled360.com!

To learn more about what we do at Montevideo Labs visit our website: www.blend360.com.

Download your e-book today!