When a disk in your ZFS pool fails, the pool enters a DEGRADED state. ZFS offers a robust recovery mechanism called resilvering, which allows you to replace the failed disk with a new one and rebuild the data. This article explains how to replace a failed disk and complete the resilvering process in your Proxmox ZFS pool.
When a disk fails in the ZFS pool, the pool status will show as DEGRADED. Use the following command to check the pool status:
zpool status
Example output:
pool: rpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Replace the device using 'zpool replace'.
From the zpool status
output, note the ID of the failed disk. For example:
5446107257933431427 UNAVAIL 0 0 0
Physically replace the failed disk with a new one. Verify that the new disk is recognized by the system:
ls -l /dev/disk/by-id/
Find the new disk’s ID (e.g., ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T0809124
).
If the new disk contains an existing filesystem, ZFS may reject it. Clear the filesystem using:
wipefs -a /dev/disk/by-id/<new_disk_id>
Replace <new_disk_id>
with the appropriate disk ID.
Use the zpool replace
command to replace the failed disk with the new disk:
zpool replace rpool <failed_disk_id> /dev/disk/by-id/<new_disk_id>
Example:
zpool replace rpool 5446107257933431427 /dev/disk/by-id/ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T0809124
After replacing the disk, ZFS will automatically start rebuilding the data (resilvering). Monitor the progress using:
zpool status
Example output:
scan: resilver in progress since Sat Dec 21 21:32:03 2024
10.9G / 10.9G scanned, 56.6M / 5.37G issued at 14.2M/s
21.2M resilvered, 1.03% done, 00:06:24 to go
The process duration depends on the amount of data and system performance.
Once the resilvering process is complete, verify that the pool has returned to a healthy state:
zpool status
All disks should be listed as ONLINE. Example output:
pool: rpool
state: ONLINE
scan: resilvered 5.37G in 0h7m with 0 errors on Sat Dec 21 21:39:01 2024
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1080266-part3 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1101796-part3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T1401427-part3 ONLINE 0 0 0
ata-WDC_WD30EFRX-68AX9N0_WD-WCC1T0809124 ONLINE 0 0 0
errors: No known data errors
If you encounter issues:
zpool replace
fails, check for existing partitions or data on the new disk and clear them using wipefs
.dmesg | grep ZFS
The resilvering feature in ZFS allows you to recover from disk failures without data loss. By following these steps, you can replace a failed disk and restore your ZFS pool to a healthy state. Regular monitoring and proper maintenance of your storage system ensure long-term reliability.