how to recover from pool error?

Questions about SNMP, Power, System, Logs, disk, & RAID.
azqn841
New here
Posts: 6
Joined: Mon Feb 26, 2018 11:53 pm

how to recover from pool error?

Post by azqn841 »

I am getting a pool error message on reboot (system was healthy but there was a power failure, a file system check which may not have finished on system volume, and firmware upgrade to latest).

The GUI comes up blank but I am able to SSH.

The system volume I could sacrifice, but there is a volume that passed a health check that I would like to be able to access again.

How should I proceed with recovery?

Which command line commands would be useful to check the status of which volume failed, etc.?

QNAP TS-653A Firmware from 20180528.

Thanks!
User avatar
storageman
Ask me anything
Posts: 5506
Joined: Thu Sep 22, 2011 10:57 pm

Re: how to recover from pool error?

Post by storageman »

The file system check is automatic on power failure
Wait to see whether the check finishes.
azqn841
New here
Posts: 6
Joined: Mon Feb 26, 2018 11:53 pm

Re: how to recover from pool error?

Post by azqn841 »

Doesn't seem to finish... It says Error Message Pool Error.
dgiessler
First post
Posts: 1
Joined: Tue Jul 24, 2018 1:32 am

Re: how to recover from pool error?

Post by dgiessler »

Also seeing this error.
Recently replaced a failed drive in a RAID5 set.
I had shutdown (for 2 weeks) and replaced the drive while powered down.
The RAID set rebuilt (apparently) but the storage volume remained unmounted.
Performed a RAID scrub,and got the "storage pool" error.
I am finding no documentation on the QNAP site. Can anyone point us at anything more generic to help out?

Thanks!
User avatar
storageman
Ask me anything
Posts: 5506
Joined: Thu Sep 22, 2011 10:57 pm

Re: how to recover from pool error?

Post by storageman »

You don't replace disks when powered down. Read the documentation.
Post "md_checker" results from Putty/SSH
azqn841
New here
Posts: 6
Joined: Mon Feb 26, 2018 11:53 pm

Re: how to recover from pool error?

Post by azqn841 »

Welcome to MD superblock checker (v1.4) - have a nice day~

Scanning system...

HAL firmware detected!
Scanning Enclosure 0...

RAID metadata found!
UUID: 9f1a70c6:64cea2ea:67200f96:c7545d20
Level: raid5
Devices: 4
Name: md1
Chunk Size: 512K
md Version: 1.0
Creation Time: Jan 3 21:14:29 2018
Status: OFFLINE
===============================================================================
Disk | Device | # | Status | Last Update Time | Events | Array State
===============================================================================
1 /dev/sda3 0 Active Jul 24 21:32:19 2018 3119 AAAA
2 /dev/sdb3 1 Active Jul 24 21:32:19 2018 3119 AAAA
3 /dev/sdc3 2 Active Jul 24 21:32:19 2018 3119 AAAA
4 /dev/sdd3 3 Active Jul 24 21:32:19 2018 3119 AAAA
===============================================================================
azqn841
New here
Posts: 6
Joined: Mon Feb 26, 2018 11:53 pm

Re: how to recover from pool error?

Post by azqn841 »

From output below:
Error info :
/dev/md1 : need to be recovered.
What's the procedure for this recovery?
Thanks!


qcli_storage
Enclosure Port Sys_Name Size Type RAID RAID_Type Pool TMeta VolType VolName
NAS_HOST 1 /dev/sda 9.10 TB data /dev/md1(X) RAID 5 1(X) 64 GB flexible --
NAS_HOST 2 /dev/sdb 9.10 TB data /dev/md1(X) RAID 5 1(X) 64 GB flexible --
NAS_HOST 3 /dev/sdc 9.10 TB data /dev/md1(X) RAID 5 1(X) 64 GB flexible --
NAS_HOST 4 /dev/sdd 9.10 TB data /dev/md1(X) RAID 5 1(X) 64 GB flexible --

Error info :
/dev/md1 : need to be recovered.
md13 mount failed!
[/etc] # qcli_storage -d
Enclosure Port Sys_Name Type Size Alias Signature Partitions Model
NAS_HOST 1 /dev/sda HDD:data 9.10 TB -- QNAP FLEX 5 Seagate ST10000VN0004-1ZD101
NAS_HOST 2 /dev/sdb HDD:data 9.10 TB -- QNAP FLEX 5 Seagate ST10000VN0004-1ZD101
NAS_HOST 3 /dev/sdc HDD:data 9.10 TB -- QNAP FLEX 5 Seagate ST10000VN0004-1ZD101
NAS_HOST 4 /dev/sdd HDD:data 9.10 TB -- QNAP FLEX 5 Seagate ST10000VN0004-1ZD101

====
[/etc] # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md322 : active raid1 sdd5[3](S) sdc5[2](S) sdb5[1] sda5[0]
7235136 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md256 : active raid1 sdd2[3](S) sdc2[2](S) sdb2[1] sda2[0]
530112 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md13 : active raid1 sda4[0] sdd4[3] sdc4[2] sdb4[1]
458880 blocks super 1.0 [32/4] [UUUU____________________________]
bitmap: 1/1 pages [4KB], 65536KB chunk

md9 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
530048 blocks super 1.0 [32/4] [UUUU____________________________]
bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>


[/etc] # uname -a
Linux NAS1AD1E5 4.2.8 #1 SMP Mon May 28 01:36:04 CST 2018 x86_64 GNU/Linux
User avatar
storageman
Ask me anything
Posts: 5506
Joined: Thu Sep 22, 2011 10:57 pm

Re: how to recover from pool error?

Post by storageman »

I believe you have a pending rebuild state.

Try this
"/etc/init.d/init_lvm.sh"

If it doesn't work try

"mdadm -CfR --assume-clean /dev/md1 -l 5 -n 4 -c 64 -e 1.0 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3"
azqn841
New here
Posts: 6
Joined: Mon Feb 26, 2018 11:53 pm

Re: how to recover from pool error?

Post by azqn841 »

Thanks for the recommendation!

init_lvn.sh makes the RAID online but I can't seem to see/mount anything, no UI comes up, after reboot RAID is offline again...
(same thing after the mdadm command)

init_lvn.sh:
Found duplicate PV d3a9ZLgCrxu7sPqeIywzceisAQXa3LQO: using /dev/drbd1 not /dev/md1
Using duplicate PV /dev/drbd1 from subsystem DRBD, ignoring /dev/md1
LV Status NOT available
sh: /sys/block/dm-4/dm/pool/tier/relocation_rate: Permission denied
kill: you need to specify whom to kill
kill: you need to specify whom to kill
Done

md_checker:
RAID metadata found!
UUID: 9f1a70c6:64cea2ea:67200f96:c7545d20
Level: raid5
Devices: 4
Name: md1
Chunk Size: 512K
md Version: 1.0
Creation Time: Jan 3 21:14:29 2018
Status: ONLINE (md1) [UUUU]
===============================================================================
Disk | Device | # | Status | Last Update Time | Events | Array State
===============================================================================
1 /dev/sda3 0 Active Jul 25 20:00:17 2018 3119 AAAA
2 /dev/sdb3 1 Active Jul 25 20:00:17 2018 3119 AAAA
3 /dev/sdc3 2 Active Jul 25 20:00:17 2018 3119 AAAA
4 /dev/sdd3 3 Active Jul 25 20:00:17 2018 3119 AAAA
===============================================================================

cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md1 : active raid5 sda3[0] sdd3[3] sdc3[2] sdb3[1]
29269449216 blocks super 1.0 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

md322 : active raid1 sdd5[3](S) sdc5[2](S) sdb5[1] sda5[0]
7235136 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md256 : active raid1 sdd2[3](S) sdc2[2](S) sdb2[1] sda2[0]
530112 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md13 : active raid1 sda4[0] sdd4[3] sdc4[2] sdb4[1]
458880 blocks super 1.0 [32/4] [UUUU____________________________]
bitmap: 1/1 pages [4KB], 65536KB chunk

md9 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
530048 blocks super 1.0 [32/4] [UUUU____________________________]
bitmap: 1/1 pages [4KB], 65536KB chunk

qcli_storage:
Enclosure Port Sys_Name Size Type RAID RAID_Type Pool TMeta VolType VolName
NAS_HOST 1 /dev/sda 9.10 TB data /dev/md1 RAID 5,512 1 64 GB flexible DataVol1,DataVol2
NAS_HOST 2 /dev/sdb 9.10 TB data /dev/md1 RAID 5,512 1 64 GB flexible DataVol1,DataVol2
NAS_HOST 3 /dev/sdc 9.10 TB data /dev/md1 RAID 5,512 1 64 GB flexible DataVol1,DataVol2
NAS_HOST 4 /dev/sdd 9.10 TB data /dev/md1 RAID 5,512 1 64 GB flexible DataVol1,DataVol2
md13 mount failed!
User avatar
storageman
Ask me anything
Posts: 5506
Joined: Thu Sep 22, 2011 10:57 pm

Re: how to recover from pool error?

Post by storageman »

You sure? It says online here
Mousetick
Experience counts
Posts: 1081
Joined: Thu Aug 24, 2017 10:28 pm

Re: how to recover from pool error?

Post by Mousetick »

There is this error 'md13 mount failed!' at the end of qcli_storage output posted by azqn841. I'm no expert but I don't think it's supposed to be there. Sorry can't provide any help but just pointing it out in case you missed it, storageman.
User avatar
storageman
Ask me anything
Posts: 5506
Joined: Thu Sep 22, 2011 10:57 pm

Re: how to recover from pool error?

Post by storageman »

I saw the error. Don't worry about md13, it's not important (workarea).
So repeat question is the pool online now???
azqn841
New here
Posts: 6
Joined: Mon Feb 26, 2018 11:53 pm

Re: how to recover from pool error?

Post by azqn841 »

It's online but can't be mounted... (online doesn't survive reset).
Talked to support engineer on a ticket and they'll get back to me after developer escalation.

Some of the error messages in the support session were:

Check of pool vg1/tp1 failed (status:1). Manual repair required!

superblock is corrupt
bad checksum in superblock

Thanks!
User avatar
storageman
Ask me anything
Posts: 5506
Joined: Thu Sep 22, 2011 10:57 pm

Re: how to recover from pool error?

Post by storageman »

Did you try run the reassemble or just the LVM restart?

Try dumping the superblock info
"dumpe2fs_64 -h /dev/mapper/cachedev1"
User avatar
patricepm
Getting the hang of things
Posts: 65
Joined: Mon Jul 03, 2017 9:29 am

Re: how to recover from pool error?

Post by patricepm »

Hi all,

I'm joining in on this discussion...
I'm having a RAID10 with 4 disks. Problem I had...my nephew :( lost him out of site for a sec and he managed to pull disk 3 and 4...I know, believe it or not...
Anyway, I've plugged them back in and rebuilding began. Until this afternoon when the status light blinked red and no access via web interface was possible and all drives were quiet for a long time. So, I've powered down the NAS, holding the pwr button. This morning I've had access to the NAS, seeing that all my data was still there. Now, I see some known folders/files and short cuts to my shares but the shares them selves aren't there... I suppose the rebuilding got interrupted for some reason.

MD_Checker reports:

[/etc] # md_checker

Welcome to MD superblock checker (v1.4) - have a nice day~

Scanning system...

HAL firmware detected!
Scanning Enclosure 0...

RAID metadata found!
UUID: 34f18f98:c15319a0:661e1906:44e77c10
Level: raid10
Devices: 4
Name: md1
Chunk Size: 512K
md Version: 1.0
Creation Time: Jul 19 12:08:54 2018
Status: OFFLINE
===============================================================================
Disk | Device | # | Status | Last Update Time | Events | Array State
===============================================================================
3 /dev/sda3 0 Active Jul 31 16:37:42 2018 10038728 AARA
4 /dev/sdb3 1 Active Jul 31 16:37:42 2018 10038728 AARA
5 /dev/sde3 2 Rebuild Jul 31 16:37:42 2018 10038728 AARA
6 /dev/sdd3 3 Rebuild Jul 31 16:37:42 2018 10038728 AARA
===============================================================================

And the init_lvm.sh:

[/etc] # /etc/init.d/init_lvm.sh
Changing old config name...
Reinitialing...
Detect disk(8, 0)...
dev_count ++ = 0Detect disk(8, 16)...
dev_count ++ = 1Detect disk(8, 32)...
ignore non-root enclosure disk(8, 32).
Detect disk(8, 48)...
dev_count ++ = 2Detect disk(8, 64)...
dev_count ++ = 3Detect disk(8, 0)...
Detect disk(8, 16)...
Detect disk(8, 32)...
ignore non-root enclosure disk(8, 32).
Detect disk(8, 48)...
Detect disk(8, 64)...
sys_startup_p2:got called count = -1
Done

After this command, I checked again in the Storage manager, it says that there's no volume...


Before the above two commands I also did a mdadm command:
First:

[/] # mdadm --assemble /dev/md0 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3
mdadm: no RAID superblock on /dev/sdc3
mdadm: /dev/sdc3 has no superblock - assembly aborted


Then I've found on the internet I could leave sdc3 out so:

[/] # mdadm --assemble /dev/md0 /dev/sda3 /dev/sdb3 /dev/sdd3 --verbose
mdadm: looking for devices for /dev/md0
mdadm: failed to get exclusive lock on mapfile - continue anyway...
mdadm: /dev/sda3 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb3 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdd3 is identified as a member of /dev/md0, slot 3.
mdadm: added /dev/sdb3 to /dev/md0 as 1
mdadm: no uptodate device for slot 4 of /dev/md0
mdadm: added /dev/sdd3 to /dev/md0 as 3
mdadm: added /dev/sda3 to /dev/md0 as 0
mdadm: /dev/md0 assembled from 3 drives - need 5 to start (use --run to insist). <===== This is very strange to me because I only have 4 drives in my setup.


If anybody has any suggestions on how to handle, would be great! I'm not an expert, but I do my best.
Thanks in advance for all your answers and help!
==================================================
QNAP TVS-473
- 4x WD Red Pro 6TB (RAID 10)
- 2x WD SA510 1TB (RAID 1)
- 2x Samsung 970 Evo plus m.2 1TB (RAID 1)
- 40GB Memory
- Firmware: QTS 5.1.1.2491
Post Reply

Return to “System & Disk Volume Management”