When good databases go bad

Dealing with database corruption

Ants Aasma, Cybertec

Introductions

About me

Currently I work for Cybertec, helping others run PostgreSQL.
- High-availability setups
- Automated deployment
- Making PostgreSQL run FAST.
- Fix up databases that have gone pear shaped.
Previously a software architect at Estonian Ministry of Interior.
ants@cybertec.at

The problem statement

PostgreSQL is a very reliable piece of software
But sometimes Bad Stuff just happens
Being prepared will make the difference.

ERROR: invalid page in block 572319 of relation base/123456/234567

ERROR: invalid memory alloc request size 18446744073709551614

Overview of the talk

What can go wrong
What to do when corruption does happen
How to recover
Keeping out of trouble

What can go wrong?

Hardware failures

Broken hardware means broken databases.
Lots of stuff can break your server, but for data integrity we can make do with a simplified view.
- CPU
- RAM
- Network
- Storage
- Power supply

CPU

Rarely the issue
Mostly causes crashes when it is

RAM

Failures are rather frequent.
Occasional 1 bit errors ("cosmic rays") are more rare than broken chips (e.g. row line stuck at 1).
Can silently corrupt large amounts of data.
ECC mostly fixes it.

Network

TCP/IP checksums by and large take care of the data integrity issues.
Use SSL if you need protection from nefarious agents or for a belt-and-suspenders approach.
More about availability than integrity.

Storage

Modern drives are on the edge of not working.
- Raw media error rate is about 1:1000.
- Average block read has 32 errors.
SSDs are not any better.
Error correction codes mostly fix it under the covers by using 100 bytes of ECC per 4k block.
- Except when they don't.

Bad blocks

Outside of specialized data recovery shops that data is gone.
Get your database off that drive ASAP, before further damage occurs!

Silent data corruption

"Of the total sample of 1.53 million disks, 3855 disks developed checksum mismatches – 3088 of the 358,000 nearline disks (0.86%) and 767 of the 1.17 million enterprise class disks (0.065%)." - An Analysis of Data Corruption in the Storage Stack

Misdirected writes.
Writes replaced with zeroes.
Torn writes.
Bit errors and garbage data is pretty rare.
If there is damage it's pretty likely to have large amounts of it.

Power supply

In theory losing power will not cause data loss.
In practice...
- Some drives, especially cheap SSDs, lie about fsync
  - If pg_test_fsync numbers are too good to be true, the probably are.
- PostgreSQL crash recovery does not get as much testing as normal operation.
- Spectacular power supply failures can fry your whole RAID set at once.

Software failures

Filesystem, storage driver, hypervisor, etc.
PostgreSQL
Backup and replication tools
Firmware and other magic hidden in your storage system

Operator failures

Incorrect (or missing) backup practices
Accidental DROP TABLE
Overzealous optimization
- Do NOT touch fsync and full_page_writes
Multiple servers accessing a single data dir.
"Why are all these logs are taking up disk space? They're not even readable."

What to do when stuff has hit the fan?

Keep calm

It's easy to cause more issues by making rash decisions.
Try not to burn any bridges.

Keep notes

During times of stress it's easy to forget stuff and overlook things.
Write down all error messages, steps you have tried, etc.
It's extremely helpful when you need to get external help.

First, do no harm

Take a physical backup before you make things worse by trying to fix them.
- Preferrably an offline backup.
- If offline backup is not possible, check to see if the backup is actually usable before continuing.
Arrange for WALs to be kept around.
Turn off any automated failover software.
In case you do screw up, you can try again.
In most cases disable application access while you are working.

Find and fix the root cause

Every case of corruption is different.
Figure out what is causing the problem and try to eliminate it.
Establish that things are sane from the bottom up.
- Is the hardware running ok?
- Any interesting error messages in system logs?
- Is the data readable?
```
find $PGDATA -type f | xargs -l md5sum
```
- Any recent crashes?

Establish a plan

You need to figure out what steps to take to get back to a correctly running database.
What is the last usable backup?
Do you have WAL to do PITR from that point?
Is it more important to get the database up or retain all data.
Is it something you can fix in an automated fashion or do you need manual fixups.

How to recover

Index corruption

Complex data structure with tricky locking, so most likely to have bugs.
Errors will manifest as strange query output, infinite loops.
- Broken indexes on system catalog indexes can result in very strange errors.
Can be fixed by rebuilding the index using REINDEX.
- For unique indexes you may get constraint violations. These will need manual fixing.
Ensure you are using the latest version and your hardware is working correctly.
- If everything else is working correctly, report a bug.

Table corruption

Most likely you will lose data.
Can take many forms, depending on where the bogus data is.
- Garbage in page header usually results in whole page becoming unreadable.
- Garbage in tuple header (bogus xmin/xmax) causes clog/multixact misses.
- Garbage in row itself can be silently returned, result in memory allocation failures or even crashes.
- Badly linked row versions are also possible.
Dump and restore is the best bet.

Handling pg_dump failure

If dump fails with errors, bisect around the broken data.
I have a little tool that extracts everything readable:
https://github.com/ants/pg-recovery-tools/blob/master/trycopy.py
Tries to copy out a table in recursively smaller batches until it identifies the page and line pointer numbers that is causing errors or crashes. Logs them for detailed study.
- pageinspect can provide some useful information
- Taking a quick look with a hex editor may also be useful
CTID scans for PostgreSQL would make this tool a lot faster...

Clog corruption

Invalid data leads to duplicate rows/missing rows.
Missing CLOG pages means you can't access rows that reference them.
- More likely to be caused by heap corruption and bogus XIDs.
- To rescue data you can fake the file by filling it with 0x55 (everything commited). Expect manual cleanup of extracted data.

pg_resetxlog

pg_resetxlog WILL break your database.
Use it when database does not come up and you want extract at least some data.
dump/restore is required after using.
I have a little tool that allows you to replace error causing records with empty ones.

https://github.com/ants/pg-recovery-tools/blob/master/xlogfilter.py

How to keep yourself out of trouble?

Keep up to date on updates

New versions fix bugs.
- You are more likely to have an issue due to not installing a bugfix release than installing it.
Recently had a client with corruption on a PostgreSQL 7.4.1 instance.
Upgrade your kernel too.

Backups

Take regularly scheduled backups.
Keep as many as you can afford. Have a retention schedule.
- x daily backups, y weekly backups, z monthly backups
Test restoring from a backup.
Use WAL archiving for PITR capability.
- pg_receivexlog is very useful

Don't use flaky hardware

Not using ECC memory is just asking for trouble.

"For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year." - DRAM Errors in the Wild: A Large-Scale Field Study
Consumer level storage devices play fast and loose with your data.
Overclocking your database server is a bad idea.

Checksums

WAL is checksummed by default.
You can turn on data checksums at initdb time.
- Provides early detection, if you do regular scrubbing.
Clog has no checksums :(
BTRFS or ZFS will also give you checksums.
- Huge throughput hit (50%), even on SSDs.
- BTRFS has horrible behaviors under heavy write loads.
- If your workload is light the nice features may still be worth it.

Redundancy

RAID1 and RAID5 can save you from the occasional bad block or failed drive.
Having a streaming replica will handle storage failures AND can also save your skin from OS level failures.
- Redundant Array of Inexpensive Servers
PostgreSQL bugs are likely to cause corruption that gets replicated to standbys.

Logical backups

Run logical backups on a regular schedule.
If you can't afford the throughput hit and long snapshot on the master server use a hot standby.
- hot_standby_feedback = off
- max_standby_streaming/archive_delay = -1
- You probably don't want to use a high-availability standby for this
Logical backups make sure that everything in the database is still readable.

When good databases go bad

Dealing with database corruption

Ants Aasma, Cybertec

Introductions

About me

The problem statement

Overview of the talk

What can go wrong?

Hardware failures

CPU

RAM

Network

Storage

Bad blocks

Silent data corruption

Power supply

Software failures

Operator failures

What to do when stuff has hit the fan?

Keep calm

Keep notes

First, do no harm

Find and fix the root cause

Establish a plan

How to recover

Index corruption

Table corruption

Handling pg_dump failure

Clog corruption

pg_resetxlog

How to keep yourself out of trouble?

Keep up to date on updates

Backups

Don't use flaky hardware

Checksums

Redundancy

Logical backups

Putting it all together

Recipe for retaining your data

That is all

Questions?