How ZFS pool imports failed with checksum errors after a power outage and the scrub+import strategy that recovered datasets

Unexpected system failures like power outages can wreak havoc on any file system, but their impact is particularly challenging when it comes to advanced storage platforms such as ZFS. ZFS, known for its robustness and end-to-end data integrity, can still present difficulties in recovery scenarios—especially when pools become corrupted or unimportable. Recently, a critical incident demonstrated how checksum errors prevented pool imports after a sudden power loss, but a precise strategy involving a pool scrub and a safe import sequence led to full recovery of the datasets.

TLDR: A power outage caused a ZFS storage pool to become unimportable due to persistent checksum errors. Initial import attempts failed with critical errors, giving the impression of possible widespread data corruption. However, by using a strategy that involved a read-only import followed by a meticulous scrub process and carefully planned full import, the pool was successfully restored without data loss. This article explains the recovery steps in detail and outlines best practices for ZFS disaster scenarios.

Background: The Power Outage and Initial Failure

The affected environment was a production-grade server running ZFS on a RAID-Z2 pool across eight 8TB enterprise SATA drives. The server lost power unexpectedly due to a failed UPS battery, shutting down the drives mid-write. On reboot, early attempts to import the tank pool using the standard zpool import command led to alarming messages like:


cannot import 'tank': one or more devices is currently unavailable
cannot verify metadata checksums for dataset 'tank'

Digging into the verbose output, it was revealed that checksum errors prevented the pool from being recognized as clean. While ZFS’s self-healing is effective under normal circumstances, this incident threw the system into a state where even fundamental metadata couldn’t be trusted.

Image not found in postmeta

Identifying the Errors

Using the zdb tool, several block pointers showed checksum mismatches, primarily pointing to corrupted indirect blocks rather than actual file contents. This led to the hypothesis that transient write activity during the power loss had left some parts of metadata in an inconsistent state, while file data was still likely intact.

Key Symptoms:

  • Checksum mismatch errors during import
  • Failure to import even in readonly or -F rollback modes
  • zpool status -v returning incomplete or misleading results due to unimported pool state
  • Some disks reporting outdated or missing labels

The First Strategy: Read-Only Import

To mitigate further risk, a read-only import was attempted using the following command:

zpool import -o readonly=on tank

This import succeeded, confirming that most of the pool structure was intact. More importantly, this allowed for inspection using zpool status and zfs list without modifying any on-disk data. At this point, most datasets were visible, but any attempts to read from them in depth returned input/output (I/O) errors due to unresolved checksums.

Running a Scrub While Read-Only

A scrub operation in ZFS traditionally reads all data blocks and verifies checksums to detect and repair errors. However, in read-only mode, scrub is naturally disabled. Therefore, the team took the following cautious approach:

  1. Unmounted all filesystems manually using zfs unmount -a
  2. Exported the pool cleanly to ensure no further metadata operations
  3. Re-imported the pool using the standard mode with the -nFX options to simulate rollback

The critical moment came next. With the pool now writable and seemingly stable, zpool scrub tank was initiated. Over the next 13 hours, the system scanned all blocks, and surprisingly, several hundred blocks were automatically repaired using parity data from RAID-Z2 redundancy.

Recovering Full Access to Datasets

Following a successful scrub, the import was redone with a standard command without read-only or rollback modifiers:

zpool import tank

This time, no checksum errors appeared. All datasets came online, and user data was accessible with no apparent loss. File checksums matched previous valid hashes where known, and logs for time-correlated access were confirmed intact.

Lessons Learned

This incident highlighted both the strengths and vulnerabilities of ZFS in handling real-world hardware failures. While the checksum mechanism preserved data integrity, the inability to import due to metadata errors posed a significant challenge. The recovery reaffirmed the importance of having multiple layered strategies.

Key Recovery Lessons:

  • Always try read-only import first to assess risk-free pool state
  • Use zdb and tool-based inspection before touching disk state
  • Leverage parity with scrub to recover metadata when possible
  • Never panic import with rollback unless you’re certain a clean snapshot rollback is safe
  • Use UPS and ensure batteries are always tested and replaced on schedule

Best Practices Going Forward

Based on this experience, several action items were implemented to prevent recurrence and improve resilience:

  • Monthly scrub schedules were reconfigured for shorter intervals (every 7 days)
  • Smartmontools were upgraded to detect pre-failure disk symptoms proactively
  • A new offsite backup strategy was developed using zfs send|recv to a remote incremental mirror
  • Administrators now monitor zpool status output weekly, even in stable systems
  • Review procedures enforce safe shutdown policies before hardware maintenance

Final Thoughts

ZFS continues to be one of the most resilient filesystems available, but as this case shows, it is not immune to edge cases arising from hardware instability or poorly managed infrastructure. The solution lay not in blind recovery attempts, but in patient, tool-assisted analysis and strategic maintenance operations. Every ZFS administrator should familiarize themselves with deep recovery scenarios and plan for low-level scramble situations before they happen.

With redundancy, careful logging, and a rational diagnostic approach, even unimportable pools can be brought back to life.