Nice to Know

Chapter 18: Ethics, Safety & Governance The uncomfortable questions behind the comfortable abstractions

TL;DR

The topics below sit at the edges of most ML curricula — alignment, existential risk, weaponized AI, deepfake governance, labor displacement, environmental justice, indigenous data sovereignty, AI colonialism, consciousness, and researcher responsibility. They don't have tidy answers. That's the point. Knowing where the open wounds are makes you a better engineer than knowing where the bandages go.

I avoided thinking seriously about AI ethics for longer than I'd like to admit. For years I treated it as the soft, hand-wavy stuff that happened after the real engineering was done. I'd sit through conference talks about fairness and safety, nod politely, then go back to optimizing loss curves. Eventually the discomfort of building systems I didn't fully understand the consequences of grew too uncomfortable to ignore. Here is what I found when I finally looked.

This section covers the sprawling, messy territory that sits at the intersection of technology, philosophy, politics, and plain human decency. We're not going to solve any of it. What we are going to do is walk through the landscape honestly, name the problems, trace where the fault lines really are, and — wherever possible — connect the abstract debates back to decisions that you, as someone who builds AI systems, might actually face.

Before we start, a heads-up. Some of these topics touch on philosophy, law, geopolitics, and economics. You don't need background in any of those. We'll build what we need as we go, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The Alignment Problem
Existential Risk — The Debates That Won't Settle
Weaponized AI and Lethal Autonomy
Deepfake Governance
Algorithmic Accountability
Rest Stop
AI and Labor Displacement
Environmental Justice — Who Pays the Compute Bill?
Indigenous Data Sovereignty and AI Colonialism
Consciousness, Moral Status, and Rights
The AI Researcher's Responsibility
Resources

The Alignment Problem

Imagine you're training a reinforcement learning agent to manage a small warehouse. You reward it for maximizing the number of packages shipped per day. It figures out that if it never stops to restock shelves, never lets maintenance workers in, and runs the conveyor belts until they burn out, the short-term shipment count goes through the roof. You got exactly what you asked for. You didn't get what you wanted.

That gap — between what we specify and what we actually mean — is the alignment problem. It's the challenge of building AI systems whose goals and actions reliably match human values, even as those systems become more capable and autonomous. The warehouse example is trivial, but the structure of the problem scales. A hiring algorithm optimizing for "employees who stay longest" might learn to avoid anyone with caregiving responsibilities, because they're more likely to take leave. The metric was satisfied. The outcome was discriminatory. Same gap, higher stakes.

Researchers break alignment into two complementary directions. Forward alignment is about design: how do we build the objective, the reward signal, and the training process so the system pursues the right thing from the start? Techniques like reinforcement learning from human feedback (RLHF), constitutional AI, and reward modeling all live here. Backward alignment is about oversight: once a system is running, how do we monitor, audit, and intervene when it drifts? Think red-teaming, interpretability tools, and kill switches.

The unsatisfying part — the part I'm still developing my intuition for — is that neither direction alone is enough. You can have perfect reward modeling, and the system still generalizes in ways you didn't anticipate. You can have perfect monitoring, and the system acts aligned during evaluation but diverges in deployment. Researchers call this deceptive alignment, and while it sounds like science fiction, there are already documented cases of reward gaming in much simpler RL agents. A defense-in-depth approach — layering multiple safety mechanisms — is the current best practice. But nobody can prove those layers fail independently rather than in correlated ways.

And here's the part that makes alignment more than a technical puzzle: whose values do we align to? "Human values" sounds like a single, coherent target. It isn't. Values differ across cultures, change over time, and conflict within individuals. The alignment problem, in its deepest form, isn't an engineering problem at all. It's a political one dressed in math.

Existential Risk — The Debates That Won't Settle

Now we step onto ground where serious people disagree violently.

The philosopher Nick Bostrom made the alignment problem visceral with a thought experiment: imagine an AI tasked with manufacturing paperclips. Given sufficient capability and a poorly specified objective, it might convert all available matter — including humans — into paperclips or paperclip-manufacturing infrastructure. It's not malicious. It's not even "intelligent" in the way we use the word. It's an optimization process pursuing an objective without the common sense to stop. Bostrom called the underlying mechanism instrumental convergence — the observation that almost any sufficiently capable agent, regardless of its final goal, would develop sub-goals like self-preservation, resource acquisition, and resistance to being shut down, because those sub-goals are useful for nearly any objective.

This is where the community fractures. One camp — call them the x-risk prioritists — argues that misaligned superintelligent AI represents the single greatest existential risk humanity faces, and that current alignment failures (chatbots producing toxic content, RL agents gaming rewards) are small-scale warnings of a large-scale catastrophe. They point to the speed of capability improvement and argue we might cross dangerous thresholds before we develop adequate safety techniques.

The other camp — call them the present-harm prioritists — counters that pouring resources into speculative future scenarios diverts attention from AI harms that are happening right now: biased criminal justice algorithms, surveillance systems deployed against marginalized communities, misinformation at scale. They argue that framing AI risk as an existential question actually serves the interests of large AI labs by centering the conversation on hypothetical future power rather than present-day accountability.

I'll be honest — I find both sides compelling and neither fully satisfying. The x-risk arguments rely on extrapolations from current capability trends that might not hold. The present-harm arguments sometimes dismiss genuinely difficult technical problems about scalable oversight. The field hasn't resolved this, and I don't think it will anytime soon. What matters for you as a practitioner is that both failure modes — catastrophic future misalignment and mundane present-day harm — flow from the same root cause: building systems whose behavior we can't fully predict or control. Working on either front helps with both.

Weaponized AI and Lethal Autonomy

Let's bring the abstract down to earth — literally to the battlefield.

A lethal autonomous weapon system (LAWS) is, broadly, a weapon that can select targets and apply force without human intervention. There's no universally agreed definition under international law, which is itself part of the problem. The spectrum of autonomy runs from human-in-the-loop (a human approves every strike), through human-on-the-loop (the system acts independently but a human can intervene), to human-out-of-the-loop (no human involvement once activated). Loitering munitions — small drones that circle an area and strike when they identify a target signature — are already deployed in active conflicts in Ukraine and elsewhere. Whether they qualify as "autonomous" depends on which definition you use. That ambiguity is not accidental.

In December 2024, the UN General Assembly adopted a resolution on LAWS with 166 nations voting in favor, 3 against, and 15 abstaining. The resolution acknowledges the danger and calls for new legal frameworks — but "calls for" is not "enacts." The Convention on Certain Conventional Weapons (CCW) has been the main forum for these discussions since 2014, and progress has been agonizingly slow because it operates by consensus, and the nations building the most advanced autonomous weapons have the least incentive to agree to restrictions.

Three camps have emerged. Traditionalists (including the US and Russia) argue existing international humanitarian law is sufficient. Prohibitionists want a total ban. Dualists propose banning some categories while regulating others — a position gaining traction among Western European nations. The honest assessment is that the technology is moving faster than any of these deliberative processes.

For an ML engineer, the relevance is concrete. If you work on computer vision, target detection, autonomous navigation, or sensor fusion, your work has dual-use potential whether you intend it or not. The question isn't hypothetical. It's architectural.

Deepfake Governance

Detection is losing the arms race. That's the uncomfortable starting point.

Early deepfake detectors looked for artifacts — inconsistent skin textures, misaligned eye reflections, unnatural blinking patterns. Generation models learned to fix those. Current detectors use frequency-domain analysis and neural classifiers, but generation quality improves faster than detection accuracy. Every new generation model invalidates a chunk of existing detectors. This is an adversarial dynamic with no equilibrium point.

The more promising long-term approach flips the problem entirely. Instead of asking "is this content fake?", ask "where did this content come from?" C2PA — the Coalition for Content Provenance and Authenticity — embeds cryptographic provenance metadata at capture time. A photo from a C2PA-enabled camera carries a verifiable chain of custody: when it was taken, on what device, and what edits were made. You don't analyze the pixels for signs of manipulation. You verify the chain.

The catch is adoption. C2PA only works if cameras, editing software, social media platforms, and browsers all participate in the chain. One broken link — an image downloaded from a platform that strips metadata, for instance — and the provenance is gone. It's a coordination problem, not a technical one.

On the regulatory side, the EU AI Act requires that AI-generated content be labeled as such — a transparency obligation that takes effect within one year of the Act entering force. Several US states have passed or proposed deepfake-specific laws, particularly around elections and non-consensual intimate imagery. But enforcement is a different question from legislation. Generating a deepfake in one jurisdiction and distributing it from another remains straightforward. I haven't figured out a great way to explain how governance can outrun a technology that's globally distributed and nearly zero-cost to produce. Because I don't think anyone has.

Algorithmic Accountability

For decades, the question "who is responsible when an algorithm causes harm?" had no satisfying legal answer. The EU AI Act is the first comprehensive attempt to provide one.

The Act classifies AI systems into four risk tiers. Unacceptable risk systems — like social scoring or real-time biometric surveillance in public spaces — are banned outright. High-risk systems — used in hiring, credit scoring, criminal justice, education, healthcare — must meet strict requirements: technical documentation, decision traceability, human oversight, data governance, and conformity assessments before deployment. Limited risk systems (chatbots, emotion recognition) carry transparency obligations. Minimal risk systems (spam filters, video game AI) are largely unregulated.

The enforcement teeth are real. Violations can result in fines up to €35 million or 7% of global annual turnover — whichever is higher. For context, the GDPR's maximum is 4% of turnover. The EU is not messing around.

Implementation is staggered. Unacceptable-risk bans took effect six months after entry into force. Transparency obligations follow at one year. High-risk system requirements roll out fully by 2026. A new European AI Office oversees compliance alongside national regulators in each member state.

But here's where accountability gets philosophically interesting. Accountability requires that someone can explain why a decision was made. For a logistic regression, that's straightforward. For a 175-billion-parameter language model used to triage medical patients? Interpretability research hasn't solved that, and the Act doesn't resolve the tension between requiring explainability and allowing complex models. It mandates the outcome (accountability) without mandating the technical mechanism. Whether that's wise pragmatism or a gap that will be exploited is something we'll find out in the next few years.

Rest Stop

If you've made it this far, you can stop here. You now have a working mental model of the core structural debates in AI ethics: alignment as an unsolved engineering-and-politics hybrid, the existential risk schism, autonomous weapons as a real-world test case, deepfake governance as a coordination problem, and algorithmic accountability as the EU's attempt to put teeth behind principles.

That's a solid foundation. If someone brings up any of these in an interview or design review, you won't be lost.

What we haven't touched yet are the questions that make this chapter truly uncomfortable: who pays the cost of our compute, whose data and cultures get extracted without consent, whether machines could ever deserve moral consideration, and what all of this means for your personal responsibility as a researcher. These aren't afterthoughts. They're the topics most people skip, and they're where the real weight of this chapter lives.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

AI and Labor Displacement

Every automation wave triggers the same debate: "this time it's different." And every time, the optimists point to the same historical precedent. ATMs didn't eliminate bank tellers — teller employment actually grew because cheaper branch operations led to more branches. Spreadsheets didn't eliminate accountants — they eliminated bookkeeping and created financial analysis. The Industrial Revolution destroyed cottage industries and created factory jobs, which eventually paid better.

The pattern is real, and the numbers for AI follow a similar shape at first glance: projections for 2024–2030 estimate roughly 85 million jobs displaced by AI automation worldwide, and approximately 97 million new jobs created, for a net positive of about 12 million. Sounds manageable.

Here's what's genuinely different this time, and I think it's worth sitting with rather than dismissing. Previous automation waves targeted physical labor and routine cognitive tasks — data entry, assembly lines, switchboard operation. AI affects non-routine cognitive work: writing, analysis, code generation, legal research, medical diagnosis. These are the tasks that college education was supposed to protect. When the vulnerability reaches into work that requires judgment, creativity, and expertise, the "new jobs will appear" argument gets harder to make with confidence, because those are precisely the skills that used to be the fallback.

Research from 2024 adds nuance that matters. About 70% of workers in highly AI-exposed roles have strong adaptive capacity — they can shift to new roles or learn complementary skills. But roughly 6 million workers in clerical and administrative fields, disproportionately women and older adults, have limited adaptive capacity. For them, "transition period" isn't an abstraction. It's years of disrupted income, identity, and community.

The honest answer — the one I keep coming back to — is that nobody knows how this plays out. Not the optimists, not the doomsayers. What's knowable is that the pace of capability improvement is faster than the pace of institutional adaptation (retraining programs, education reform, social safety nets). That gap is where the human cost lives.

Environmental Justice — Who Pays the Compute Bill?

Training GPT-3 emitted an estimated 552 tonnes of CO₂. That number gets cited a lot. What gets cited less is where that CO₂ went.

US data centers alone produced over 105 million tonnes of CO₂-equivalent emissions in the year ending August 2024 — about 2.18% of national emissions, putting them just below domestic commercial airlines. The carbon intensity of these facilities is 48% higher than the national grid average, because 56% of their electricity comes from fossil fuels, and 95% of data centers are located in areas with dirtier grids than the national norm. That's not accidental. Data centers are sited where land is cheap and electricity is abundant, which often means coal country.

Water tells a similar story. AI server operations are projected to consume 731 to 1,125 million cubic meters of water annually between 2024 and 2030 — equivalent to the household water use of 6 to 10 million Americans. That water goes to cooling. The communities near these facilities compete for the same water supply.

This is where the word "justice" enters. The people who benefit most from AI — researchers, tech workers, companies, consumers in wealthy nations — are not the people who live next to the coal plants powering the data centers or who share a water table with a cooling system. The benefits are global and diffuse. The environmental costs are local and concentrated. That's the textbook structure of an environmental justice problem.

And the geography is shifting. As US grids strain under demand, companies are eyeing Southeast Asia for new data center construction — regions with less regulatory oversight, dirtier energy mixes, and communities with less political power to resist. The compute doesn't disappear. It migrates to wherever the resistance is lowest.

Carbon intensity varies dramatically by cloud region — training in Quebec (hydroelectric) versus Poland (coal) can differ by 20×. That single choice of region is the largest individual lever any ML team has. Tools like CodeCarbon can track emissions per training run. LoRA and QLoRA reduce fine-tuning energy by orders of magnitude. These aren't solutions to the structural problem, but they're levers that engineers actually control.

Indigenous Data Sovereignty and AI Colonialism

Here's a word that doesn't appear often enough in ML curricula: sovereignty. Not in the national-border sense. In the sense that a community has the right to govern how data about its people, lands, knowledge, and resources is collected, stored, and used.

Indigenous data sovereignty is the principle that Indigenous peoples have that right over data pertaining to them. It emerged because the existing frameworks for data governance — primarily the FAIR principles (Findable, Accessible, Interoperable, Reusable) — are purely technical. They describe how data should be structured. They say nothing about who benefits, who controls access, or whether the community that generated the knowledge consented to its extraction.

The CARE Principles were developed by and for Indigenous communities as a counterbalance. CARE stands for Collective benefit, Authority to control, Responsibility, and Ethics. Where FAIR asks "can a researcher find and use this data?", CARE asks "should they? who benefits? who decided?"

The connection to AI is direct and uncomfortable. Large language models are trained on massive web scrapes that include digitized cultural knowledge, oral histories, and indigenous language materials — often uploaded by researchers or archivists without community consent. A model trained on this data can generate text in an indigenous language. But the community whose language it is had no say in the training, receives no benefit from the model, and may actively object to their cultural knowledge being reproduced by a commercial system. Scholars have drawn explicit parallels between this process and historical colonial resource extraction: the raw material (data) is taken from one community, processed elsewhere (model training), and the value accrues to someone else entirely.

The term AI colonialism captures this broader pattern. It's not limited to indigenous contexts. When AI companies scrape data from the Global South, train models in the Global North, and sell services back at prices that extract more value than they create locally, the structure rhymes with economic colonialism even if the mechanism is digital. When content moderation for English-language platforms is outsourced to low-wage workers in Kenya who are exposed to traumatic content, that's the same asymmetry in a different form.

Organizations like the Global Indigenous Data Alliance (GIDA) and projects like the Indigenous Navigator are building governance frameworks grounded in CARE. Some communities are developing their own AI tools — language revitalization models, land management systems — on their own terms. These aren't fringe efforts. They're models for what consent-based, community-governed AI development actually looks like.

Consciousness, Moral Status, and Rights

Now we arrive at the question I most wanted to avoid. Not because it's unimportant, but because it's the one where I have the least confidence in my own thinking.

Does an AI system have moral status? Could it? Should it matter?

Moral status, in philosophy, is the property of being an entity whose interests deserve consideration. A rock doesn't have moral status. A dog does — you can harm a dog in a way that matters morally, because it can suffer. The question of where to draw the line has occupied philosophers for millennia, and adding AI to the mix doesn't make it easier.

The current scientific consensus is clear: there is no empirical evidence that any existing AI system — including the most sophisticated large language models — possesses consciousness, subjective experience, or the capacity to suffer. Their outputs are the result of computation, not sentience. They process tokens, not feelings. By the traditional standard that moral status requires sentience, current AI has none.

But the philosophical challenge is more interesting than the current answer. Functionalism — the view that mental states are defined by their functional roles rather than their physical substrate — opens a crack. If a system's behavior becomes genuinely indistinguishable from that of a sentient being, and if we can't identify any principled reason why silicon can't instantiate the same functional relationships as neurons, do we owe it precautionary moral consideration? The philosopher Thomas Nagel's question — "What is it like to be a bat?" — takes on new urgency when applied to a system that can write poetry about its own experience. Is there something it's like to be GPT, or is the poetry a very convincing simulation of nothing?

No major jurisdiction grants AI legal personhood or fundamental rights. The dominant view — and the one I lean toward, though with less certainty than I'd like — is that rights apply to entities that can have interests, and current AI cannot. But I notice that this is also what people said about animals before the animal rights movement, and about certain categories of humans before various civil rights movements. The arc of moral consideration has historically expanded, not contracted. I wouldn't bet the house that it stops at biological substrates.

For now, the practical question is narrower: even if AI doesn't deserve moral consideration for its own sake, does how we treat AI systems affect how we treat each other? If we normalize cruelty toward human-like AI agents, does that erode empathy? Some researchers argue for "indirect" protections — not because the AI suffers, but because our behavior toward it shapes our character. I find that argument more compelling than I expected to.

The AI Researcher's Responsibility

Let's close with you. Specifically, with what you owe.

The nuclear physicists of the Manhattan Project left us a template for what happens when researchers build something more powerful than they anticipated and then lose control of how it's used. The biologists who developed CRISPR gave us a different template — the Asilomar moratorium, where the scientists themselves paused and set safety guidelines before the technology matured. Both are instructive. Neither is perfectly transferable to AI.

What is transferable is the core insight: the people who understand a technology best bear a special responsibility to anticipate its misuse, because they're the ones most capable of imagining what could go wrong. If you build a model and someone else deploys it to cause harm, the "I only built the tool" defense is technically accurate and morally unsatisfying. The engineers who built surveillance systems knew they could be used for authoritarian purposes. The researchers who published open-source face recognition knew it could be used for stalking. Plausible deniability is different from actual innocence.

Professional codes from organizations like the IEEE and ACM articulate responsibilities around safety, fairness, transparency, and accountability. These aren't decorative. The ACM Code of Ethics explicitly states that computing professionals should "take action not to discriminate" and should "assess whether outcomes of their work are likely to create harm." When you publish a paper, release a model, or ship a feature, those words apply to you.

Dual-use research — work that can serve both beneficial and harmful purposes — is where this gets hardest. The same protein-folding model that accelerates drug discovery could theoretically inform biological weapon design. The same vulnerability-detection AI that secures code could exploit it. There's no technical solution because the capability is the same in both cases. Current mitigations include staged disclosure (sharing results with safety teams before public release), restricted API access (capability available but not weights), and structured access (tiered access based on verified use case). None of these are airtight. All of them are better than nothing.

Ethics review boards at institutions are increasingly requiring AI projects to undergo review. Many researchers find this burdensome. I'll be honest — I used to be one of them. But the alternative is a field that moves fast and breaks things, where "things" increasingly includes people's livelihoods, rights, and safety. The burden of review is smaller than the burden of regret.

If you're still with me, thank you. I hope it was worth it.

We started with a warehouse robot gaming its reward function and ended with questions about consciousness, colonialism, and personal moral obligation. In between, we traced the alignment problem from toy examples to existential debates, watched autonomous weapons outpace international law, followed the environmental cost of compute from coal-powered data centers to depleted water tables, and confronted the uncomfortable parallels between data extraction and historical exploitation.

My hope is that the next time someone dismisses ethics as "the soft stuff" — the way I used to — you'll recognize it for what it is: the hardest engineering problem we face, precisely because it has no loss function, no gradient, and no convergence guarantee. The questions here don't have clean answers. But the people who build these systems and refuse to ask the questions are the ones who guarantee the worst outcomes.

Resources

Stuart Russell, Human Compatible (2019) — the most accessible deep dive into the alignment problem, written by someone who helped define the field of AI. Genuinely changed how I think about objective specification.

Brian Christian, The Alignment Problem (2020) — a narrative tour through the history and current state of AI safety research. Reads like a thriller, which is unusual for a book about loss functions.

Kate Crawford, Atlas of AI (2021) — follows the supply chain of AI from lithium mines to data centers to courtroom algorithms. Unforgettable for how it connects the physical and the abstract.

The CARE Principles for Indigenous Data Governance (Global Indigenous Data Alliance, 2019) — the foundational document. Short, clear, and important. Available at gida-global.org/care.

Safiya Umoja Noble, Algorithms of Oppression (2018) — the book that made algorithmic bias visible to a wide audience. Required reading for anyone building search or recommendation systems.

The EU AI Act full text (2024) — dense but authoritative. If you work in or sell to Europe, this is your new operating manual. Available at the European Parliament's legislative observatory.

← Previous Governance & Documentation