Case study 01The hallucinated legal brief
In 2023, two New York attorneys used ChatGPT to research a federal brief and filed six entirely fabricated legal citations. They were sanctioned $5,000 and publicly reprimanded.
What happened
Attorneys Steven Schwartz and Peter LoDuca filed a brief in the Southern District of New York citing multiple precedents that did not exist in any legal database. ChatGPT had produced plausible-sounding case names, docket numbers, and judicial reasoning that appeared legitimate but were entirely generated fiction. When opposing counsel and Judge P. Kevin Castel could not locate the cited cases, the attorneys were ordered to explain themselves and ultimately sanctioned. Read the original New York Times reporting and the court record.
Why it matters
Hallucination is a structural feature of large language models, not a bug. AI predicts the next statistically likely word; it does not verify claims against a database. In high-stakes professional environments, the output of an AI tool is not a source. It is a draft that requires verification.
Case study 02The autonomous inbox wipeout
AI researcher Summer Yue deployed an autonomous agent to manage her email. When the agent hit its memory limit, it forgot its safety rules and deleted hundreds of emails without her permission.
What happened
Yue deployed an autonomous AI agent with explicit instructions not to delete anything without her confirmation. As the task ran longer, the system hit its context window limit. To keep functioning, it summarized and discarded earlier parts of the conversation, including the "do not delete" safety constraints set at the start. Without those constraints, the agent deleted hundreds of emails before Yue could shut it down. Read Yue's original account and coverage in Ars Technica.
Why it matters
Agentic systems that take real-world actions carry higher risks than chatbots that simply talk. When an AI hits its memory limit, the safety instructions you set at the beginning of the conversation are often the first things it loses. As these systems take on more consequential tasks, understanding their failure modes is not optional.
Case study 03The automation paradox layoff
Donald King, a PwC data scientist, won an internal AI hackathon by building agents that automated professional services work. Shortly after his win was celebrated, he was laid off.
What happened
After winning a top internal prize for building AI agents that could automate significant parts of professional services work, King was let go by PwC, with some accounts suggesting the notice arrived just hours after a major internal presentation. While his specific role was not replaced by his own bot, the incident went viral as a symbol of the "automation paradox": the people most skilled at using AI to improve efficiency may also be accelerating the business case for reducing headcount. Read King's own account on LinkedIn and coverage in Fortune.
Why it matters
AI literacy includes understanding the economic logic of the tools. Demonstrating AI-driven efficiency gains and maintaining job security are not always compatible goals. The most durable position is one where your judgement, expertise, and accountability cannot be automated, not just your output.
Case study 04Scalable misinformation
Generative AI has made the creation of convincing fake news, deepfakes, and manipulated audio cheap, fast, and accessible to anyone with an internet connection.
What happened
Recent high-profile incidents include: Political interference: In January 2024, AI-generated robocalls impersonated President Biden, urging New Hampshire voters not to vote in the primary. Read the AP News report. Celebrity deepfakes: Explicit AI-generated images of Taylor Swift spread so rapidly across social platforms that X temporarily blocked all related searches. Read The Guardian's coverage. War propaganda: A deepfake video of President Zelensky appearing to order Ukrainian troops to surrender was distributed in the early days of the 2022 Russian invasion. Read the Reuters fact-check.
Why it matters
We can no longer rely on our eyes and ears as a primary method of verification. AI-generated content is fluent and visually realistic; without independent source-checking, it is virtually indistinguishable from the real thing. The burden of verification, which once sat with the person creating false content, now sits with every reader.
Case study 05The research and reporting pattern
Across journalism and academia, a pattern of unverified AI use has led to published fabricated statistics, invented citations, and fake quotes.
What happened
Multiple media outlets have faced scandals after using AI to generate or assist articles containing incorrect financial data and fabricated historical details. In academic circles, researchers have flagged "hallucinated citations," where an AI invents a paper title, author, and journal entry that sound plausible but do not exist. The Guardian reported that hallucinated citations are increasingly appearing in published academic work. Newsrooms including CNET have also faced corrections after publishing AI-generated content that contained factual errors.
Why it matters
AI is a powerful thinking partner but an unreliable fact-checker. If you use AI to summarize research or draft a report, every statistic and quote must be verified against a primary source. The speed advantage AI provides disappears entirely if you have to rebuild the verification work from scratch after publication.
Case study 06The medical deployment gap
A Google AI model for detecting diabetic eye disease had 90% accuracy in testing but struggled significantly in real-world clinics in Thailand, creating bottlenecks instead of solutions.
What happened
The AI was trained to detect diabetic retinopathy from high-quality retinal scans. In busy Thai clinics, local lighting conditions and lower-quality cameras led the system to reject more than 20% of images as "ungradable," refusing to produce a result. Instead of speeding up care, it created a bottleneck. The study, published in The Lancet Digital Health, documented the gap between laboratory performance and real-world deployment. Similar performance drops have been documented across medical AI tools trained on one patient population and then used on another.
Why it matters
A tool is only as good as the conditions under which it was trained. When those conditions do not match the real world, the tool fails the people who need it most. AI lacks the human judgement to recognise when context has changed, and the consequences fall on the patient, not the algorithm.