Before You Delete That Data Point…
Is it just bad data or something spectacular? A practical guide to spotting outliers and knowing when they deserve a second look
In statistics, an outlier is a data point that differs significantly from other observations. If you’ve been working with data for any length of time, it’s likely (statistically 😛) that you’ve run into your fair share of outliers.
The problem is, the way we identify, handle, and explain outliers depends a lot on the context and situation at hand. Sometimes, outliers are no more than erroneous datapoints caused by bad measurement. Other times, outliers are true data points that can tip us off to something important like a broken system, a new trend, or an opportunity you haven’t seen before.
Today, I’ll give you a framework for how to identify outliers, when to exclude them from you analysis, and how to understand what these rogue datapoints are telling us.
🏊♀️ Example: True Outlier or Measurement Error?
Let’s say you’re analyzing a dataset of NCAA women’s 500-yard freestyle swim times.
Here’s what you find:
Average time: 4:45.00
1st Quartile (fastest 25%): 4:40.00
3rd Quartile (slowest 25%): 4:50.00
Interquartile Range (IQR): 0:10.00 (this is the difference between the 1st Quartile and 3rd Quartile)
Range: Most swimmers fall between 4:35 (-1.5 IQR) and 4:55 (+1.5 IQR)
Then you see Katie Ledecky’s time: 4:24.06 !!
It’s clearly an outlier. But here’s the thing: it’s a real outlier. A performance like that reflects the emergence of a swimming prodigy, not a mistake in the data.
It tells a story. It points to a new paradigm for how fast NCAA women’s swimming can be. If you removed that datapoint just because it was “too far from the average,” you’d erase the very insight that makes the dataset interesting.
Now let’s say another time pops up: 2:36.04, logged by a swimmer who historically finishes around 4:50.
Unless this swimmer suddenly discovered a wormhole in Lane 3, this is almost certainly a measurement error: a mis-timed race, a stopwatch fumble, or a fat-fingered input by whoever reported the times.
Both times are extreme, but only one is trustworthy. And that’s why it’s so important to use both statistical methods and domain knowledge/situational context when assessing outliers. Numbers can flag anomalies, but the right context can tell you what to do with them.
Why Outliers Matter
Identifying and handling outliers appropriately is important. Outliers can have a huge impact on your analysis:
They skew your averages.
They can exaggerate trends or hide real ones.
They might cause you to make the wrong decision.
But… they can also be the key to uncovering valuable insights:
A signal that something new or important is happening.
The first sign of a new user segment you haven’t seen before.
The warning that you’ve found the next Katie Ledecky
The trick is to figure out whether an outlier is noise or a clue.
How to Identify Outliers
Here’s how I like to tackle it:
🔎 Visuals First
Data visualizations can be really useful for quickly identifying outliers. In a scatterplot or box-plot outliers tend to stick out like a sore thumb.
🔎 Simple Stats
There are a whole bunch of statistical methods you can use to identify outliers including standard deviation, interquartile range (IQR), or Z-scores. One of the most common and simple is done using interquartile range.
To identify outliers using the Interquartile Range (IQR), first calculate the IQR by subtracting the 1st quartile (Q1) from the 3rd quartile (Q3). Any data point below Q1 − 1.5×IQR or above Q3 + 1.5×IQR is typically considered an outlier. This method helps flag values that are unusually far from the middle 50% of your data.
🔎 Context is King
Statistics will help you identify anomalies, but only your domain knowledge will tell you if it matters.
Example: You're analyzing pizza delivery times to see if Domino's is hitting its 30-minutes-or-less promise. You spot a few deliveries logged at 3 seconds. Unless Domino’s R&D has secretly cracked the code on teleportation, you’ve probably got some bad data on your hands.
When (and How) to Remove Outliers
🛑 When to remove them:
Data entry errors (typos, corrupted logs).
Measurement errors (faulty sensors, broken scripts).
Context mismatch (some Men’s NCAA swim times incorrectly ended up in your dataset).
✅ When to keep them:
If they’re real and relevant—like a sudden spike in traffic from a viral campaign.
If they hint at a new customer segment or behavior.
📝 How to remove (or handle) them:
Flag them, but don’t delete right away—document what you’re doing!
Use alternative methods (like median instead of mean) if you can’t remove them but still want a fair picture.
Always sanity-check: does removing them change the story in a way that makes sense?
Pro tip: For any given analysis, pick a detection method and stick with it. Don’t cherry-pick rules just to get cleaner charts. Inconsistent handling invites bias and makes your process hard to reproduce.
⚠️ Common Pitfalls
Making the chart look “prettier” by deleting outliers. Clean visuals are great—but not if they hide the truth.
Assuming outliers are always errors. Sometimes they’re the most interesting part of the story.
Cherry-picking data by excluding datapoints that don’t fit your narrative
Not explaining or documenting why you removed them (or didn’t). Consistency and repeatability are important. It’s best practice to be transparent in outlier
(That 2:36.04 swim time didn’t delete itself…)
Recommended Resource
This is NOT a stats textbook. It is one of my favorite reads. Malcolm Gladwell explains what makes certain people achieve extraordinary success, from elite athletes to tech founders. He argues Outliers aren’t just born, they’re shaped by a mix of talent, timing, and opportunity.
💭 Closing Thought
It’s tempting to toss out outliers and move on. After all, they’re messy, confusing, and not pretty when visualized. But as we’ve seen, those strange little data points might just be trying to tell you something important.
Hopefully, you’re walking away with a few good questions to ask the next time something looks off, and a better sense of when an outlier is a problem… or a hidden insight waiting to be explored.
Outliers are a great excuse to slow down, get curious, and make sure the story your data is telling is actually the one you want to tell.
Thanks for reading and see you next week!