When good databases go bad

Dealing with database corruption

Ants Aasma, Cybertec

Introductions

About me

  • Currently I work for Cybertec, helping others run PostgreSQL.
    • High-availability setups
    • Automated deployment
    • Making PostgreSQL run FAST.
    • Fix up databases that have gone pear shaped.
  • Previously a software architect at Estonian Ministry of Interior.
  • ants@cybertec.at

The problem statement

  • PostgreSQL is a very reliable piece of software
  • But sometimes Bad Stuff just happens
  • Being prepared will make the difference.
ERROR: invalid page in block 572319 of relation base/123456/234567
ERROR: invalid memory alloc request size 18446744073709551614

Overview of the talk

  • What can go wrong
  • What to do when corruption does happen
  • How to recover
  • Keeping out of trouble

What can go wrong?

Hardware failures

  • Broken hardware means broken databases.
  • Lots of stuff can break your server, but for data integrity we can make do with a simplified view.
    • CPU
    • RAM
    • Network
    • Storage
    • Power supply

CPU

  • Rarely the issue
  • Mostly causes crashes when it is

RAM

  • Failures are rather frequent.
  • Occasional 1 bit errors ("cosmic rays") are more rare than broken chips (e.g. row line stuck at 1).
  • Can silently corrupt large amounts of data.
  • ECC mostly fixes it.

Network

  • TCP/IP checksums by and large take care of the data integrity issues.
  • Use SSL if you need protection from nefarious agents or for a belt-and-suspenders approach.
  • More about availability than integrity.

Storage

  • Modern drives are on the edge of not working.
    • Raw media error rate is about 1:1000.
    • Average block read has 32 errors.
  • SSDs are not any better.
  • Error correction codes mostly fix it under the covers by using 100 bytes of ECC per 4k block.
    • Except when they don't.

Bad blocks

  • Outside of specialized data recovery shops that data is gone.
  • Get your database off that drive ASAP, before further damage occurs!

Silent data corruption

"Of the total sample of 1.53 million disks, 3855 disks developed checksum mismatches – 3088 of the 358,000 nearline disks (0.86%) and 767 of the 1.17 million enterprise class disks (0.065%)." - An Analysis of Data Corruption in the Storage Stack

  • Misdirected writes.
  • Writes replaced with zeroes.
  • Torn writes.
  • Bit errors and garbage data is pretty rare.
  • If there is damage it's pretty likely to have large amounts of it.

Power supply

  • In theory losing power will not cause data loss.
  • In practice...
    • Some drives, especially cheap SSDs, lie about fsync
      • If pg_test_fsync numbers are too good to be true, the probably are.
    • PostgreSQL crash recovery does not get as much testing as normal operation.
    • Spectacular power supply failures can fry your whole RAID set at once.

Software failures

  • Filesystem, storage driver, hypervisor, etc.
  • PostgreSQL
  • Backup and replication tools
  • Firmware and other magic hidden in your storage system

Operator failures

  • Incorrect (or missing) backup practices
  • Accidental DROP TABLE
  • Overzealous optimization
    • Do NOT touch fsync and full_page_writes
  • Multiple servers accessing a single data dir.
  • "Why are all these logs are taking up disk space? They're not even readable."

What to do when stuff has hit the fan?

Keep calm

  • It's easy to cause more issues by making rash decisions.
  • Try not to burn any bridges.

Keep notes

  • During times of stress it's easy to forget stuff and overlook things.
  • Write down all error messages, steps you have tried, etc.
  • It's extremely helpful when you need to get external help.

First, do no harm

  • Take a physical backup before you make things worse by trying to fix them.
    • Preferrably an offline backup.
    • If offline backup is not possible, check to see if the backup is actually usable before continuing.
  • Arrange for WALs to be kept around.
  • Turn off any automated failover software.
  • In case you do screw up, you can try again.
  • In most cases disable application access while you are working.

Find and fix the root cause

  • Every case of corruption is different.
  • Figure out what is causing the problem and try to eliminate it.
  • Establish that things are sane from the bottom up.
    • Is the hardware running ok?
    • Any interesting error messages in system logs?
    • Is the data readable?

      find $PGDATA -type f | xargs -l md5sum
    • Any recent crashes?

Establish a plan

  • You need to figure out what steps to take to get back to a correctly running database.
  • What is the last usable backup?
  • Do you have WAL to do PITR from that point?
  • Is it more important to get the database up or retain all data.
  • Is it something you can fix in an automated fashion or do you need manual fixups.

How to recover

Index corruption

  • Complex data structure with tricky locking, so most likely to have bugs.
  • Errors will manifest as strange query output, infinite loops.
    • Broken indexes on system catalog indexes can result in very strange errors.
  • Can be fixed by rebuilding the index using REINDEX.
    • For unique indexes you may get constraint violations. These will need manual fixing.
  • Ensure you are using the latest version and your hardware is working correctly.
    • If everything else is working correctly, report a bug.

Table corruption

  • Most likely you will lose data.
  • Can take many forms, depending on where the bogus data is.
    • Garbage in page header usually results in whole page becoming unreadable.
    • Garbage in tuple header (bogus xmin/xmax) causes clog/multixact misses.
    • Garbage in row itself can be silently returned, result in memory allocation failures or even crashes.
    • Badly linked row versions are also possible.
  • Dump and restore is the best bet.

Handling pg_dump failure

  • If dump fails with errors, bisect around the broken data.
  • I have a little tool that extracts everything readable:
    https://github.com/ants/pg-recovery-tools/blob/master/trycopy.py
  • Tries to copy out a table in recursively smaller batches until it identifies the page and line pointer numbers that is causing errors or crashes. Logs them for detailed study.
    • pageinspect can provide some useful information
    • Taking a quick look with a hex editor may also be useful
  • CTID scans for PostgreSQL would make this tool a lot faster...

Clog corruption

  • Invalid data leads to duplicate rows/missing rows.
  • Missing CLOG pages means you can't access rows that reference them.
    • More likely to be caused by heap corruption and bogus XIDs.
    • To rescue data you can fake the file by filling it with 0x55 (everything commited). Expect manual cleanup of extracted data.

pg_resetxlog

  • pg_resetxlog WILL break your database.
  • Use it when database does not come up and you want extract at least some data.
  • dump/restore is required after using.
  • I have a little tool that allows you to replace error causing records with empty ones.
https://github.com/ants/pg-recovery-tools/blob/master/xlogfilter.py

How to keep yourself out of trouble?

Keep up to date on updates

  • New versions fix bugs.
    • You are more likely to have an issue due to not installing a bugfix release than installing it.
  • Recently had a client with corruption on a PostgreSQL 7.4.1 instance.
  • Upgrade your kernel too.

Backups

  • Take regularly scheduled backups.
  • Keep as many as you can afford. Have a retention schedule.
    • x daily backups, y weekly backups, z monthly backups
  • Test restoring from a backup.
  • Use WAL archiving for PITR capability.
    • pg_receivexlog is very useful

Don't use flaky hardware

  • Not using ECC memory is just asking for trouble.

    "For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year." - DRAM Errors in the Wild: A Large-Scale Field Study

  • Consumer level storage devices play fast and loose with your data.
  • Overclocking your database server is a bad idea.

Checksums

  • WAL is checksummed by default.
  • You can turn on data checksums at initdb time.
    • Provides early detection, if you do regular scrubbing.
  • Clog has no checksums :(
  • BTRFS or ZFS will also give you checksums.
    • Huge throughput hit (50%), even on SSDs.
    • BTRFS has horrible behaviors under heavy write loads.
    • If your workload is light the nice features may still be worth it.

Redundancy

  • RAID1 and RAID5 can save you from the occasional bad block or failed drive.
  • Having a streaming replica will handle storage failures AND can also save your skin from OS level failures.
    • Redundant Array of Inexpensive Servers
  • PostgreSQL bugs are likely to cause corruption that gets replicated to standbys.

Logical backups

  • Run logical backups on a regular schedule.
  • If you can't afford the throughput hit and long snapshot on the master server use a hot standby.
    • hot_standby_feedback = off
    • max_standby_streaming/archive_delay = -1
    • You probably don't want to use a high-availability standby for this
  • Logical backups make sure that everything in the database is still readable.

Putting it all together

Recipe for retaining your data

  • Have backups available.
  • Keep on top of bug fixes.
  • Scrub your data.
  • When things go south, keep calm, keep notes and

That is all

Questions?