Data Visualization

Chapter 1: Python & Programming Foundations Deep Dive

I avoided taking visualization seriously for longer than I'd like to admit. For years, I'd plt.plot() something, squint at it, and move on. If the line went down, the model was learning. If it went up, something was broken. That was the extent of my visual vocabulary. Then I joined a team where the senior ML engineer would look at my figures and say, "This is lying to you," and I wouldn't even understand how. The discomfort of not knowing what makes a plot honest — or dishonest — finally grew too great for me. Here is that dive.

Data visualization is the practice of encoding data into visual form — positions, lengths, colors, shapes — so that human perception can extract patterns that raw numbers hide. In the Python ecosystem, this centers on matplotlib (created by John Hunter in 2003, modeled after MATLAB's plotting), with seaborn, Plotly, and Altair layered on top. The field itself draws on decades of perceptual psychology and statistical graphics theory, most notably Edward Tufte's work from the 1980s.

Before we start, a heads-up. We're going to get into matplotlib's internal architecture, color perception, the psychology of misleading charts, and production deployment patterns. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

Contents

Why we visualize (it's not about pretty pictures)
The anatomy of a plot — Figure, Axes, and Artists
The state machine trap
Our running example — predicting house prices
Choosing the right chart (the actual skill)
Seaborn — statistical visualization without the suffering
Rest Stop
The color problem nobody talks about
Interactive visualization — Plotly and when you actually need it
Production patterns — viz that doesn't break at 3am
Visualization as communication (and deception)
Wrap-up
Resources

Why We Visualize

You visualize data for exactly two reasons, and confusing them is the source of most bad charts. The first reason is to understand the data yourself — exploration. The second is to communicate what you found to someone else — presentation. These are fundamentally different activities, and the tools and standards for each are different.

Think of it like the difference between sketching on a napkin and painting a mural. The napkin sketch is for you. It can be messy, unlabeled, three overlapping plots crammed together. You're looking for surprises — is this distribution bimodal? Are these two features correlated? Is there a cluster here that shouldn't be? Speed matters. Polish doesn't.

The mural is for everyone else. Now every axis needs a label. Every color needs a legend. The title needs to tell a story, not describe a variable. And critically, the chart needs to be honest — it shouldn't exaggerate, hide, or mislead. We'll come back to what "honest" means in the visualization context later, because it's more subtle than you'd expect.

I'll be honest — I spent the first two years of my career treating every plot as a napkin sketch, even the ones going into presentations. The feedback was always the same: "What am I looking at?" That question haunts me to this day. It's the sign that the visualization failed at its job.

The Anatomy of a Plot — Figure, Axes, and Artists

Before we make anything, we need to understand what matplotlib actually builds when you create a plot. This matters because every confusing matplotlib error you'll ever encounter traces back to not understanding these three layers.

Imagine a physical artist's studio. The Figure is the canvas — the blank rectangular surface that everything lives on. You can have one canvas or many, but each one is a self-contained world. The Axes is a picture frame nailed to that canvas — it defines a rectangular region where data gets drawn, complete with its own coordinate system, tick marks, and labels. A single Figure can hold multiple Axes, the same way a single canvas can hold multiple framed pictures.

And everything you actually see — every line, every dot, every piece of text, every tick mark — is an Artist. That's matplotlib's term for any visual element. A Line2D object is an Artist. A Text object is an Artist. Even the Axes itself is an Artist, because it has a visible border. The Figure is an Artist too. It's Artists all the way down.

Here's what that hierarchy looks like in practice:

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8, 5))      # the canvas
ax = fig.add_subplot(111)              # a picture frame on the canvas
line, = ax.plot([1, 2, 3], [1, 4, 9]) # an Artist (Line2D) inside the frame

print(type(fig))   # matplotlib.figure.Figure
print(type(ax))    # matplotlib.axes._subplots.AxesSubplot
print(type(line))  # matplotlib.lines.Line2D

When matplotlib renders this to your screen or a file, it walks this tree top-down. The Figure tells each Axes to draw itself. Each Axes tells each of its Artists to draw themselves. The actual pixel-pushing happens through a Renderer (the part that knows how to turn abstract shapes into pixels) and a Canvas (the surface those pixels land on — your screen, a PNG file, a PDF). Different backends supply different Canvas-Renderer pairs, which is why the same code can produce a window on your desktop, a PNG on a server, or a vector PDF for a paper.

I'm belaboring this because the Figure-Axes-Artist distinction is the skeleton key to matplotlib. Once you see it, things that seemed arbitrary start making sense. Why is there a fig.suptitle() AND an ax.set_title()? Because they operate on different levels of the hierarchy — one labels the canvas, the other labels a frame. Why does fig.savefig() exist on the Figure? Because saving is a canvas-level operation.

We'll return to this canvas-and-frame analogy throughout, because it keeps paying off.

The State Machine Trap

Matplotlib offers two interfaces, and one of them will betray you.

The first interface is the one most tutorials teach — the pyplot state machine. You call plt.plot(), plt.title(), plt.xlabel(), and matplotlib maintains a hidden "current figure" and "current axes" behind the scenes. It works like a drawing program where there's always one active layer, and every command operates on that layer.

plt.plot([1, 2, 3], [1, 4, 9])
plt.title("My Plot")
plt.xlabel("x")
plt.show()

This is fine for a single, one-off plot. The moment you need two plots, or a function that creates a plot, or a loop that generates figures, the hidden state becomes a trap. Which figure is "current"? Which axes? You'll get plots drawn on the wrong figure, titles applied to the wrong subplot, and no error message to explain what happened.

The second interface is the object-oriented one. You create a Figure and Axes explicitly, and every operation is called on the specific object you mean.

fig, ax = plt.subplots()
ax.plot([1, 2, 3], [1, 4, 9])
ax.set_title("My Plot")
ax.set_xlabel("x")
fig.savefig("my_plot.png")
plt.close(fig)

Same result, but now there's no ambiguity. The ax object IS the specific picture frame you're drawing on. You can pass it to a function, return it, store it in a list. It's an object with a clear identity, not a side-channel to some hidden global state.

Going back to our studio analogy: the pyplot style is like yelling commands into a room and hoping the right assistant hears you. The OO style is like handing instructions directly to the person doing the work. One scales. The other doesn't.

For the rest of this section, we use the OO interface exclusively. If you take nothing else from this section, take this: fig, ax = plt.subplots() is the one line that separates code that works from code that works until it doesn't.

Our Running Example — Predicting House Prices

We need a concrete scenario to ground everything that follows. Imagine we're building a model to predict house prices from a dataset of 500 homes. The dataset has five features: square footage, number of bedrooms, year built, distance to city center, and the target — sale price. We've loaded this into a pandas DataFrame called df.

Our first instinct is to understand what we're working with. Not by staring at rows of numbers — by visualizing.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df["sqft"], df["price"], alpha=0.5, s=15)
ax.set_xlabel("Square Footage")
ax.set_ylabel("Sale Price ($)")
ax.set_title("House Prices vs. Size")
plt.close(fig)

A scatter plot. Square footage on the x-axis, price on the y-axis. Each dot is a house. We can immediately see something no summary statistic would tell us — there's a cluster of expensive small homes near the city center, and a few suspiciously cheap large homes that might be data errors. In two seconds of looking, we've learned more than five minutes of df.describe() would reveal.

That's the power of visualization. Not the chart itself — the speed of perception. Your visual cortex can process spatial patterns in milliseconds that your verbal brain takes minutes to work through. Visualization is outsourcing computation to the fastest processor you have: your eyes.

We'll keep coming back to this house price dataset as we explore different chart types and tools.

Choosing the Right Chart

Here's the actual skill of data visualization, and it has nothing to do with code. It's about asking the right question first, and then picking the chart that answers it.

When we stare at our house price data, we're not thinking "I want a histogram." We're thinking "How is price distributed? Is it normal? Is it skewed?" The chart type follows from the question. It sounds obvious, but I watch people reach for scatter plots by reflex, the way a carpenter reaches for a hammer. Not everything is a nail.

Back to our houses. We want to understand price distribution. A histogram chops the price range into bins and counts how many houses fall into each. It answers "what's common and what's rare?"

fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(df["price"], bins=30, edgecolor="white", alpha=0.8)
ax.set_xlabel("Sale Price ($)")
ax.set_ylabel("Count")
ax.set_title("Distribution of House Prices")
plt.close(fig)

Now we see it's right-skewed — most houses cluster in the $200k–$400k range, with a long tail of expensive outliers. That skew matters. It tells us we might want to log-transform the target before feeding it to a linear model.

What about the relationship between every pair of features? We could make scatter plots one at a time, but with five features that's ten combinations. Instead, the question becomes "are any features strongly correlated?" and the right answer is a correlation heatmap.

import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap="coolwarm",
            center=0, ax=ax)
ax.set_title("Feature Correlations")
plt.close(fig)

The heatmap gives us a matrix of pairwise correlations. Warm colors for positive correlations, cool for negative, white for zero. At a glance, we see that square footage and price are strongly correlated (0.85), but year built and price are weakly correlated (0.12). If two features are highly correlated with each other but not the target, one of them might be redundant.

What if we want to see how price differs between, say, houses with 2, 3, and 4 bedrooms? Now we need a box plot — it shows the median, quartiles, and outliers for a numeric variable split by a categorical one.

fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=df, x="bedrooms", y="price", ax=ax)
ax.set_title("Price Distribution by Bedroom Count")
plt.close(fig)

The box plot reveals something the average would hide: 3-bedroom homes have the widest price spread, suggesting that bedroom count alone is a weak predictor once you're in the 3-bedroom category. The outlier dots above the whiskers? Those are the houses we might want to investigate individually.

Here's the mental framework. The question determines the chart:

Your Question	Chart Type	Why This One
How is a single variable distributed?	Histogram / KDE	Shows shape, skew, modes, outliers
How do two numeric variables relate?	Scatter plot	Reveals correlation, clusters, outliers
How does a numeric variable differ across groups?	Box plot / Violin	Shows median, spread, and outliers per group
Which features are correlated?	Heatmap	Matrix view of all pairwise correlations
How does something change over time or epochs?	Line plot	Emphasizes trends and changes in order
How do categories compare in magnitude?	Bar chart	Position on a common axis; humans compare lengths well
What do all pairwise relationships look like?	Pair plot	Grid of scatter plots and histograms for every combination

That's seven chart types. They cover roughly 95% of what you'll need in an ML workflow. The remaining 5% — radar charts, treemaps, Sankey diagrams — are niche enough that you'll know when you need them, and they're not worth memorizing ahead of time.

The limitation of this framework is that it only helps you pick a chart type. It says nothing about whether the chart is honest. A histogram with 5 bins tells a completely different story than one with 50 bins, and both are "correct." We'll confront that later.

Seaborn — Statistical Visualization Without the Suffering

By now you might be wondering: if matplotlib can do everything, why does seaborn exist? Because matplotlib is like a workshop full of power tools — you can build anything, but making a simple bookshelf takes fifty operations. Seaborn is the bookshelf kit. It knows what you probably want, and it gives it to you in one function call.

Seaborn is built directly on matplotlib. Every seaborn plot is a matplotlib Figure with matplotlib Axes and matplotlib Artists. Seaborn doesn't replace the workshop — it provides pre-assembled components that you can still customize with the raw tools when needed.

What seaborn adds is three things. First, it understands pandas DataFrames natively — you pass column names as strings instead of extracting arrays. Second, it computes statistics on the fly — confidence intervals on bar plots, kernel density estimates, regression lines. Third, its defaults look good. Not "let me spend an hour tweaking" good, but "presentable in a meeting without embarrassment" good.

Let's revisit our house price data with seaborn. The single most useful exploration tool is the pair plot:

import seaborn as sns

sns.pairplot(df, hue="bedrooms", height=2.5)
plt.close("all")

One line. That produces a grid of scatter plots for every pair of features, with histograms on the diagonal, colored by bedroom count. It's the fastest way to get a holistic view of a dataset. I use it on the first day of every project, without exception.

The hue parameter is the key insight — it maps a categorical variable to color, turning a regular scatter plot into one that reveals class structure. In our case, if 4-bedroom houses cluster differently from 2-bedroom houses in the sqft-price space, the pair plot makes that visible immediately.

Seaborn also has a concept borrowed from the grammar of graphics: the FacetGrid. Instead of cramming everything into one plot with color coding, FacetGrid creates a grid of separate panels — one per category. Each panel shows the same type of plot but for a different subset of the data.

g = sns.FacetGrid(df, col="bedrooms", col_wrap=3, height=4)
g.map_dataframe(sns.scatterplot, x="sqft", y="price")
g.set_axis_labels("Square Footage", "Sale Price ($)")
plt.close("all")

This gives us one scatter plot per bedroom count, side by side. The patterns within each category are visible without the visual clutter of overlapping colors. I still have to look up the FacetGrid API roughly once a month — the syntax for map_dataframe versus map trips me up every time. That's fine. The API is consistent enough that the documentation gets you unstuck in thirty seconds.

Under the hood, every seaborn function returns a matplotlib Axes object (or a FacetGrid that contains them). So when seaborn doesn't give you exactly what you want, you can always reach through to the matplotlib layer:

fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df, x="price", kde=True, ax=ax)
ax.axvline(df["price"].median(), color="red", linestyle="--",
           label=f"Median: ${df['price'].median():,.0f}")
ax.legend()
plt.close(fig)

Seaborn draws the histogram and KDE curve. Then we reach into the matplotlib Axes to add a vertical median line — something seaborn doesn't do on its own. The two libraries compose naturally because they share the same underlying objects. The canvas-and-frame model holds: seaborn paints a picture, and we're adding a brushstroke to the same frame.

The limitation of seaborn is performance. pairplot on a DataFrame with 100,000 rows and 20 features will take minutes and produce a nearly unreadable grid. For large datasets, you need to sample first, or switch to aggregated views. Seaborn optimizes for clarity on moderate data, not for scalability.

Rest Stop

If you've made it this far, congratulations. You now have a working mental model of data visualization in Python: matplotlib provides the Figure-Axes-Artist architecture, the OO interface gives you explicit control, seaborn adds statistical smarts and sane defaults, and the chart type follows from the question you're asking. That's enough to do effective EDA on any dataset and produce plots that won't embarrass you in a meeting.

It doesn't tell the complete story, though. We haven't talked about color — and color is where most plots silently lie to you. We haven't talked about interactive visualization, which becomes essential the moment your audience isn't a Jupyter notebook. And we haven't talked about production — what happens when your plot isn't a one-off exploration but a recurring artifact that needs to work at 3am without human intervention.

The short version is: use viridis for colormaps, Plotly when someone else needs to explore your data, and plt.close(fig) in every loop. There. You're 80% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

The Color Problem Nobody Talks About

I'll be honest — the first time I understood perceptually uniform colormaps, I realized that every heatmap I'd made up to that point was subtly lying to me. Not because the data was wrong, but because the colors were.

The problem is this. When you map numbers to colors, you need equal steps in the data to look like equal steps in color. If going from 0.1 to 0.2 looks like a tiny change but going from 0.7 to 0.8 looks like a dramatic shift, your eyes will perceive patterns in the data that aren't there.

The old default colormap in matplotlib was called jet — that rainbow spectrum from blue through green and yellow to red. Jet is perceptually non-uniform. There are bands where the perceived color change accelerates (around yellow-green) and others where it stalls (around cyan). It's like using a ruler with unevenly spaced markings. You can still measure things, but your measurements will be systematically biased toward noticing changes in some ranges and missing them in others.

In 2015, matplotlib switched its default to viridis — a colormap specifically designed so that equal numerical steps produce equal perceptual steps in color. The change wasn't cosmetic. It was a correction for a systematic source of visual error.

Viridis also solves a second problem: colorblind accessibility. About 8% of men and 0.5% of women have some form of color vision deficiency, most commonly red-green. Jet fails these users entirely — large portions of the rainbow merge into indistinguishable muddy bands. Viridis, along with its siblings plasma, inferno, magma, and cividis, was designed to remain distinguishable under every common form of color vision deficiency.

The practical rule: for any sequential data (values going from low to high), use viridis or its siblings. For diverging data (values going in both directions from a center, like correlation coefficients from -1 to +1), use coolwarm or RdBu centered at zero. For categorical data, use distinct hues — seaborn's default palette handles this well.

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sequential: viridis (perceptually uniform)
sns.heatmap(df.corr().abs(), cmap="viridis", annot=True,
            fmt=".2f", ax=axes[0])
axes[0].set_title("Absolute Correlations (viridis)")

# Diverging: coolwarm centered at zero
sns.heatmap(df.corr(), cmap="coolwarm", center=0, annot=True,
            fmt=".2f", ax=axes[1])
axes[1].set_title("Correlations (coolwarm, centered)")

plt.tight_layout()
plt.close(fig)

Back to our ruler analogy: viridis is a ruler with evenly spaced markings. Jet is a ruler made by someone who was in a hurry and eyeballed the spacing. Both "work," but one gives you trustworthy measurements.

Interactive Visualization — Plotly and When You Actually Need It

Everything we've made so far is static. An image. A PNG. It cannot be interrogated. If someone looks at our scatter plot and wants to know "which specific house is that outlier at $1.2M?", they can't click on it to find out. For your own EDA in a notebook, this is fine — you're the one who made the plot, and you can re-run it with a filter. But the moment your audience is a product manager, a client, or a cross-functional team, static plots hit a wall.

Plotly renders charts as interactive HTML. Hover over a point to see its values. Zoom into a region. Click legend items to toggle series on and off. Pan, select, export. The chart lives in the browser, which means it can be embedded in notebooks, dashboards, or standalone HTML files that you email to someone who doesn't have Python installed.

import plotly.express as px

fig = px.scatter(df, x="sqft", y="price", color="bedrooms",
                 hover_data=["year_built", "distance_center"],
                 title="House Prices: Interactive Explorer")
fig.write_html("house_prices_interactive.html")

That hover_data parameter is the key. It attaches extra columns to each point's tooltip. Now the product manager can hover over the outlier and see "Oh, it's a 1920s home 0.2 miles from downtown — that's a historical property, not a data error." That single interaction replaces a back-and-forth email thread.

Plotly Express (plotly.express) is the high-level API — it mirrors seaborn's philosophy of one-function-one-plot. For full control, there's plotly.graph_objects, which is Plotly's equivalent of matplotlib's OO interface. And for building full web applications with dropdowns, sliders, and dynamic data loading, there's Dash — a framework built on Plotly that turns Python into a dashboard-building language.

When should you use Plotly instead of matplotlib? When the audience is not you. When the person looking at the chart needs to ask their own questions about the data, rather than just seeing the answers you pre-selected. When the chart is going into a web page, a Slack thread, or a client report. For your own EDA and for publication figures, matplotlib and seaborn remain the right tools.

A quick note on Altair, another option worth knowing about. Altair takes a declarative approach — you describe what you want to see ("map sqft to x, price to y, color by bedrooms") and it figures out the rendering. It's built on the Vega-Lite specification, which means charts are defined as JSON and can be rendered anywhere that supports Vega. The mental model is different from matplotlib's imperative "draw this line, then this label" approach. If the grammar-of-graphics style clicks for you, Altair is worth exploring. I'm still developing my intuition for when to reach for it over Plotly.

Production Patterns — Viz That Doesn't Break at 3am

Visualization in a notebook is one thing. Visualization in a production pipeline — generating figures automatically, logging them to experiment trackers, running on a server with no display — is a different discipline entirely.

The first production lesson everyone learns the hard way: memory leaks. Each matplotlib Figure object allocates memory. In a notebook, plt.show() renders and then the garbage collector can clean up. In a script or a training loop, if you create figures without closing them, each one hangs around. Generate a thousand figures — one per training epoch, say — and your process will bloat until it crashes. The fix is always closing what you open:

for epoch in range(1000):
    fig, ax = plt.subplots()
    try:
        ax.plot(train_losses[:epoch+1])
        ax.set_title(f"Training Loss — Epoch {epoch}")
        fig.savefig(f"plots/loss_epoch_{epoch:04d}.png",
                    dpi=100, bbox_inches="tight")
    finally:
        plt.close(fig)

The try/finally ensures the figure is closed even if the plotting code throws an error. The bbox_inches="tight" prevents axis labels from getting clipped at the edges — a problem so universal that it should be the default, but it isn't. PDF and SVG for papers (vector graphics, infinite zoom). PNG for dashboards and Slack (rasterized, universally supported).

On headless servers — your CI/CD pipeline, a cloud training job, a Docker container — there's no display for matplotlib to draw to. You need the Agg backend, which renders to an in-memory buffer without needing a screen:

import matplotlib
matplotlib.use("Agg")  # must be before importing pyplot
import matplotlib.pyplot as plt

The second production lesson: log your figures to experiment trackers. If you're using Weights & Biases, MLflow, or TensorBoard, don't take screenshots. Log programmatically.

# Weights & Biases
import wandb
wandb.log({"confusion_matrix": wandb.Image(fig)})

# MLflow
import mlflow
mlflow.log_figure(fig, "confusion_matrix.png")

# TensorBoard (PyTorch)
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
writer.add_figure("confusion_matrix", fig, global_step=epoch)

Six months from now, someone will ask "can you regenerate that chart from the experiment we ran in March?" If it's logged, you point them to the run. If it's a screenshot in a Slack thread, you spend a day recreating it from memory. Manual screenshots don't survive the first team rotation.

The third production lesson: reproducibility. Any plot that involves randomness — t-SNE, UMAP, sampled data, shuffled train/val splits — needs a random seed set before the operation. And the code that produced each figure should be versioned alongside the figure itself. The plot is the output. The code is the recipe. You need both.

Visualization as Communication (and Deception)

Edward Tufte, the godfather of data visualization theory, introduced a concept called the data-ink ratio: the proportion of ink on the page that represents actual data, versus ink spent on decoration. His argument was that you should maximize data-ink and eliminate everything else — heavy gridlines, 3D effects, ornate borders, background images. He called the decorative excess chartjunk.

Tufte's principle is more relevant now than ever, because the default output of most visualization tools includes a fair amount of chartjunk. Matplotlib's default grid, for instance, is heavier than it needs to be. Seaborn's sns.set_theme() improves this considerably — it's one of the rare cases where a single function call genuinely makes your output more honest.

But dishonest charts aren't always about decoration. The more insidious forms are structural. A truncated y-axis — starting at 95% instead of 0% — makes a 2% improvement look like a 40% jump. Cherry-picked time ranges can turn a flat trend into a hockey stick. Dual y-axes make two unrelated metrics look correlated because the scales can be independently adjusted to create any visual relationship you want. I've seen all of these in ML papers and production dashboards.

The truncated axis problem is particularly common in ML. When you're reporting accuracy improvements from 96.2% to 97.1%, starting the y-axis at 0% makes the improvement invisible. Starting it at 95% makes it dramatic. Neither is "wrong" in an absolute sense, but the second is misleading if the audience doesn't notice the axis. The honest approach: show the full axis AND a zoomed inset, or explicitly annotate the scale.

For our house price model, imagine we plot predicted vs. actual prices. If we only show the range where our model performs well and crop out the tail where it falls apart, the chart looks great. A senior interviewer or a senior engineer will ask: "What happens outside this range?" If you don't have an answer — or worse, if you cropped it deliberately — that's the end of the conversation.

The rule I try to follow: every chart should be hard to misread even by someone who glances at it for five seconds. If understanding the chart requires reading fine print about axis ranges, log scales, or cherry-picked subsets, it has failed as a communication tool.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with why visualization exists — not for pretty pictures, but for outsourcing pattern recognition to your visual cortex. We opened up matplotlib's architecture and found Figure, Axes, and Artists — a canvas, frames, and the things painted on them. We escaped the pyplot state machine trap and committed to the OO interface. We used a house price dataset to discover that the chart type follows from the question, not the other way around. Seaborn gave us statistical smarts in one-line calls. We learned that color can lie — that jet distorts and viridis corrects. Plotly gave us interactivity for audiences who need to ask their own questions. Production patterns taught us to close our figures, log to experiment trackers, and make reproducibility non-negotiable. And Tufte reminded us that the best chart is the one that's hard to misread.

My hope is that the next time you reach for plt.plot() by reflex, you'll pause and ask: what question am I answering, who's going to see this, and is this chart honest? Armed with that mental model, you'll make visualizations that don't need explaining — because they explain themselves.

Resources

Edward Tufte, The Visual Display of Quantitative Information — the O.G. treatise on honest charts. Dense but unforgettable. Read it once and you'll never unsee chartjunk again.

The matplotlib Artist tutorial — the official explanation of the Figure-Axes-Artist hierarchy. Wildly helpful once you know what you're looking at.

The seaborn tutorial — the best documentation in the Python data science ecosystem. Every example is a real dataset with real insights.

Plotly Python documentation — especially the Plotly Express section. Insightful examples of interactive charts that actually solve problems.

Nathaniel Smith and Stéfan van der Walt, "A Better Default Colormap for Matplotlib" — the talk and paper that gave us viridis. A masterclass in how perceptual science applies to everyday tools.

Alberto Cairo, How Charts Lie — a sharp, accessible book on visualization deception. Great for building the instinct to question every chart you see, including your own.

← PreviousSciPy Next →Version Control with Git