February 23, 2021
Bo Chipman, SVP & Client Partner
This is the second in a series on the making of Kick-Ass Data Scientists (KADs). I have worked with dozens of starting analysts and followed their careers over decades. These posts are based on a single, simple observation from that experience; technical knowledge does not a KAD make. The best analysts master a set of softer skills and acquire business knowledge that enables them to deliver analytical solutions that create business impact.
The last post, How to Be a Kick-Ass Data Scientist: Steps 1-3, addressed how KADS embrace their Inner Anthropologist before touching data. Great analysts interview their clients to understand the business problem. They listen. They repeat and confirm what they have heard. They ask questions and challenge assumptions to reduce the problem to its essence. Then, when they get data, KADs use another skillset. They become Journalists.
Key steps in this process are:
1. Review the Facts
2. Write the Story
3. Verify the Facts
4. Believe your Eyes
No reputable journalist starts an article without collecting the facts. Similarly, no data scientist should begin a project without looking at the data. In this case, I literally mean LOOK AT THE DATA. These are the facts an analyst uses to do her job. Open the file in a text editor. Scroll right. Scroll down. It sounds simple, and it is. However, junior analysts sometimes skip this essential step.
I have solved many problems for junior analysts by looking at the data with them. It is amazing what you may see. Common problems easily identified include:
• Truncated records
• Invalid characters
• Inappropriate line breaks
• Numbers where characters should be (and vice versa)
• Timestamps in dates
• Dates in European formats
• Zip codes missing leading zeros
Here is a story you may recognize. An analyst researching call complaints cannot identify any inbound callers in the database based on phone numbers provided by Customer Care. This is a surprising. While phone coverage is not complete, most customers have at least one phone number in the database. We sit down to review the situation, beginning with the query. Everything seems fine so we look at the customer data. As soon as we look at the data, the problem is obvious. The data base has non-numeric variables in the phone number string, while the analyst’s query uses numbers only. Problem solved, albeit after some wasted time.
You probably have your own examples. While my example is trivial, it illustrates the value of simple observation. KADs avoid problems by using their eyes. This is the first defense against a catastrophic error.
Once the facts are assembled, a journalist writes the story. I actively encourage analysts to tell a story with the data before beginning the analysis. Based on the business problem and the data received, analysts should be able to explain how the pieces fit together. You might call this an analysis plan.
For example, grocers often want to determine which sale product to put on the endcap of each aisle. Market Basket Analysis is a simple approach to making this decision. The goal is to put sale items close to other items shoppers typically buy together. This analysis requires pulling together data on real purchases. Typical inputs include:
• Order Header: High level information about the basket, i.e., sale date, sale amount, discount amount, tax amount, total amount, etc.
• Order Items: Information about particular SKUs, i.e., quantity, cost, discounts, etc.
• Product Data: SKU, description, product category, volume, color, etc.
• Customer Data: Loyalty card number, customer segment, zip code, etc.
The analyst must connect these files to complete the analysis. After looking at the data, she should be able tell a story about how the data fit together and what results should be expected. Examples of things that should be apparent are:
• Order Header and Order Items will join on some field (e.g., Order_ID)
• Order Items and Product Data will on join on some field (e.g., SKU)
• Orders must have at least one item
• Orders may have multiple items
• The sum of item level revenue should be reflected in a subtotal field in the Order Header
• Loyalty card and credit card are the only plausible links to customer data
• Some orders will not link to customer data because some consumers pay cash without scanning a loyalty card
These statements are self-evident for an experienced analyst, but they may not be obvious to a junior analyst, particularly if your data are more exotic than this example. We have found that asking to explain how they plan to assemble the data and what they expect to find is very useful. Benefits include:
• Managers quickly find out if the analyst has a plan. Often, they don’t.
• Gaps in the data or understanding are identified early. Sometimes this leads to more education for the analyst. Sometimes a file needs to be pulled again.
• The analyst thinks through the basic data processing before coding, creating more clarity in the code.
The analyst thinks through basic QC before coding by explicitly articulating expected results. This activity is second nature to KADS. For KADS-in-development, it is a valuable training exercise with benefits for managers. If an analyst cannot to tell a story about the data, they will probably struggle to execute the analysis.
After telling the story, the analyst should be ready to dig into data. It is time for a basic exploratory analysis. The very first part of this process should be verifying basic counts and the stories documented in the prior step. Check counts are a must. Analysts should also find published reports to validate the basic counts, derived variables and trends identified in the exploratory analysis. Clients (or end-user business partners) should be an integral part of this process. They typically know the basic business, and they should have some reaction to the exploratory data analyses. Clients aren’t always right, but their perspective should be sought and assessed.
This step can save countless hours of rework. Things go wrong in data pulls. Bad things happen.
• Records are lost in file transfer
• Samples are not random
• Joins are botched
• Filters are misapplied
Assume the worst. KADS do, and their efficiency and effectiveness are high as a result.
Intuition and common sense play a role in the process. If something seems wrong, it probably is. Believe your eyes. Another anecdote illustrates what happens when analysts do not believe the obvious.
I recently reviewed a forecasting model that was performing below standard. As we dug into performance, we identified several periods of poor model, which lead to an examination of the historical time series. A cursory review of the data revealed three periods that were off trend and patently unreasonable. When asked about this, the analyst agreed the data seemed surprising, but they did not want to exclude any information for fear of biasing the models. Unfortunately, that fear led them to incorporate data that undermined model performance. Further examination and a discussion with the database team uncovered errors in historical data loads that caused the outliers.
KADs actively question data that does not meet the eyeball test. They bring a healthy skepticism to all data, which serves them well. Junior analysts need to develop this capability. Actively looking for problems is a good place to start.
Knowledge of algorithms is only one of many skills successful data scientists must master. In most real-world applications, this knowledge is arguably one of the least important factors in project success. In the last blog, we discussed why understanding the business problem and its operational context is generally far more important. In this blog, we argue that fluency with the data is equally critical for success.
The best way to develop that fluency is by telling a story with the data. Key steps in this process are:
1. Review the Facts
2. Write the Story
3. Verify the Facts
4. Believe Your Eyes
Analysts who master these skills are on the road to KAD (not cad) status.