Lesson 1.4

1.4: Data

8 minutes

Data

Data is the pillar where most programs quietly fail, because nobody feels the failure until a specific thing goes wrong. You can survive for years with no data inventory. You can survive without tested backups until the Tuesday you can’t. The pillar’s job is to keep you from finding out the hard way.

What lives here

  • Classification. Which data is regulated (PII, PHI, cardholder data, source code), which is sensitive-but-not-regulated (financial projections, board decks, customer lists), and which is public. Most orgs have three classes. Two classes is fine. One class is a problem.
  • Encryption at rest and in transit. Your S3 buckets, RDS instances, laptop disks — all encrypted. Your API traffic — all TLS. This is table stakes; the failure mode is the one bucket that isn’t.
  • Backups. The backups that exist in your cloud provider’s console. Distinct from:
  • Tested backups. The backups you’ve actually restored from in the last 90 days.
  • Retention and deletion. How long you keep things, how you know when to delete them, whether deletion actually deletes.
  • DLP. Data loss prevention — the controls that notice when regulated data is leaving your environment.

What typically goes wrong

Nobody knows where customer PII lives. You think it’s in one database. It’s in that database, plus three BI exports, plus a handful of Google Sheets Sales uses for forecasting, plus the vendor your CS team uses for surveys. When the regulator asks “where does this data live?” you cannot answer without a week of archaeology.

Backups exist but aren’t tested. You have a backup policy. You have a backup retention period. Nobody has run a restore in two years. The backup you will eventually need is corrupt, or encrypted with a key only the former ops lead had, or it’s the wrong database, or it restores to a version of the schema from 2022.

DLP that produces 10,000 false positives a day. You turned on Google Workspace DLP rules six months ago. The alert queue has 47,000 unread items. Everyone ignores it. When the one real exfiltration happens, it’s in the queue.

Classification that exists on paper. The policy says “data is classified as Public, Internal, or Restricted.” No system enforces it. No document carries a label. When you ask an engineer where the Restricted data is, they stare at you.

Unmanaged copies. The production database gets copied to staging. Staging is less locked down. A developer pulls a local copy to debug a bug. The local copy sits on their laptop for six months. Multiply by 40 developers.

What mature orgs do differently

Data inventory tied to business systems. You maintain a list of systems (from the Vendors pillar) and, for each one, what kinds of data live there. When a new system is added, the inventory is updated. When a system is deprecated, the data is accounted for. This is table stakes for any privacy program.

Quarterly backup restore tests. A real restore from a real backup to a sandbox environment, with a simple success criterion: can we query the most recent production state? If the answer is no, the backup is theater. Quarterly is a floor. Monthly is better.

Classification-driven controls. Restricted data can only live in N approved systems. Sensitive data has a broader but still bounded footprint. Public is public. The rule isn’t “DLP everywhere.” It’s “here’s where regulated data is allowed; anything outside those boundaries is an incident.”

Deletion as an actual process. When a customer leaves, their data is deleted on a schedule, and the deletion is confirmed. When an employee’s Slack history has hit its retention limit, it’s removed from backups too — not just from the live workspace.

Anchor: Samsung, April 2023

Between March 30 and April 19, 2023, Samsung semiconductor engineers pasted sensitive source code and internal meeting transcripts into ChatGPT three separate times. One session was a piece of proprietary code an engineer wanted debugged. Another was a database query they wanted optimized. A third was an executive meeting recording they wanted summarized into notes.

In each case, the data left Samsung’s perimeter and entered OpenAI’s retention. Samsung banned generative AI tools internally within weeks.

The coverage treated this as an AI governance story. It wasn’t. It was a data governance story wearing an AI costume. The same problem would apply if an engineer pasted source code into a free Pastebin, an unmanaged Google Doc, a personal email, or a support ticket with an external vendor. The problem is that regulated content left the boundary, and there was no control — technical or process — that noticed or prevented it.

The fix is not “ban AI.” Samsung tried that; it held for about a year before realizing they needed to build their own internal LLM. The fix is to decide, deliberately, where regulated content is allowed to go, and enforce that through a combination of technical controls (sanctioned AI tools with data-retention guarantees, DLP rules, browser isolation) and clear policy. Data governance is not a project you finish. It’s a boundary you maintain as new tools show up — and new tools will keep showing up.