Cache missing - Cache Recovery with a Faulty SSD

Questions about SNMP, Power, System, Logs, disk, & RAID.
Post Reply
micattack
Starting out
Posts: 14
Joined: Sat Mar 06, 2021 12:25 am

Cache missing - Cache Recovery with a Faulty SSD

Post by micattack »

Hello everyone,

I need some assistance with my QNAP TS-351, which has been experiencing issues due to a faulty M.2 SSD. I'd appreciate any help on how to recover my data and get my system back online without losing anything. Here are the details:

System Configuration:

QNAP TS-351 NAS
2 x Transcend 256GB NVMe SSDs configured as RAID1 cache (raid group 2)
3 x Seagate HDD configured as RAID5 (raid group 1)

Issue:
A few days ago, I encountered errors indicating that M.2 SSD1 was having problems. Consequently, my TS-351 became unresponsive, and upon rebooting, it displayed a "Starting..." message when connecting via HDMI without any further progress.

Temporary Solution:
I removed the faulty M.2 SSD, which allowed the system to boot up again. However, the RAID group for the SSD cache is now offline, affecting the healthy HDDs raid group as well. The remaining M.2 NVMe SSD is acting as a hot spare (only SSD 1 failed, SSD 2 is fine at 72%).

Attempted Recovery:
When I clicked on the "cache missing" status, a cache-volume restore pop-up appeared.
I selected "OK" with just the one remaining SSD, even though the message says all disks need to be present.
Understandably after displaying "restoring" for a while, it showed "could not be restored."

In the settings of the cache, I saw an option to remove the cache completely, but it warned me of potential data loss.

Question:
Is there a way to switch the cache to read-only mode or a similar approach to avoid data loss until I can acquire a new second SSD? Shouldn't the RAID1 configuration be able to function with just one disk? I'm looking for a safe method to recover my data and continue using my NAS without compromising the integrity of my files.

Errors:
Storage & Snapshots RAID Group [Storage & Snapshots] SSD cache RAID group "2" is inactive.
Storage & Snapshots RAID Group [Storage & Snapshots] Failed to recover RAID group 2. Storage pool: 256.
Storage & Snapshots Storage Pool [Storage & Snapshots] Failed to activate metadata for thin storage pool 1.

Earlier:
Disabling SSD Swap partition on Host: M.2 SSD 2 because the disk's partition is degraded. Do not remove the disk until this operation has finished.
Hardware Status Drives [Hardware Status] "Host: M.2 SSD 2": Disconnected.
Storage & Snapshots Volume [Storage & Snapshots] Finished hot-removing disk "Host: M.2 SSD 2".
Successfully disabled SSD Swap partition on Host: M.2 SSD 2.


If anyone has experienced a similar issue or has any suggestions on how to proceed, your guidance would be greatly appreciated. Please let me know if you need any additional information to better understand the situation.

Thank you in advance for your help.

Best regards,
MicAttAck
Last edited by micattack on Sat Mar 25, 2023 2:52 am, edited 1 time in total.
--
QLocker survivor; backup enthusiast
TS-351 with 5. + something FW (always up2date)
Celeron J1800/8GB RAM
RAID-5 (2x 256GB Transcent TS256GMTE110S + 3x 6TB Seagate ST6000VN001)
User avatar
dolbyman
Guru
Posts: 35223
Joined: Sat Feb 12, 2011 2:11 am
Location: Vancouver BC , Canada

Re: Cache missing - Cache Recovery with a Faulty SSD

Post by dolbyman »

Several people have reported cache issues in the past, only way to access the data (if it was read only cache)was to contact QNAP via ticket.

Recovery without QNAP would involve full wipe and restore of your backups (your signature says you are a backup enthusiast, so no problemo)
micattack
Starting out
Posts: 14
Joined: Sat Mar 06, 2021 12:25 am

Re: Cache missing - Cache Recovery with a Faulty SSD

Post by micattack »

Thanks @dolbyman but that doesn't completely answer my question. I might be missing something here

1) A RAID 1 should work with just one disc
2) If it doesn't work, I feel my only option is to wait for the replacement SSD to arrive to restore the RAID1 cache group
3) Only If that approach doesn't work for me, I think I would open a ticket.

I must say I am a bit disappointed in the built in tools. I would have expected a "flush cache" option or at least some indicator of the status (is there still data in the cache that hasn't been written to the HDD). I feel this is a basic feature as errors will happen. That's why we are using RAIDs in the first place?

Looking in the FW Release notes there is a Replace Disk with Spare feature https://www.qnap.com/en/release-notes/q ... al-failure so handling an actual failure should also have been thought off?
--
QLocker survivor; backup enthusiast
TS-351 with 5. + something FW (always up2date)
Celeron J1800/8GB RAM
RAID-5 (2x 256GB Transcent TS256GMTE110S + 3x 6TB Seagate ST6000VN001)
User avatar
dolbyman
Guru
Posts: 35223
Joined: Sat Feb 12, 2011 2:11 am
Location: Vancouver BC , Canada

Re: Cache missing - Cache Recovery with a Faulty SSD

Post by dolbyman »

1) .. yes, but apparently it failed though (took the wrong disk out?)
2) if your cache group is already broken, there is nothing to restore
3) Best to open it right away

As long as the cache still works, you can disable and remove it. If the cache breaks, it's too late to flush it.
micattack
Starting out
Posts: 14
Joined: Sat Mar 06, 2021 12:25 am

Re: Cache missing - Cache Recovery with a Faulty SSD

Post by micattack »

I did post a ticket now.

I have a replacement SSD (exact same model as before) and I could not get the raid group (which shows up as having only one drive in it) to resync or add the new (empty) replacement SSD somehow.

As the working disk was the hot spare (SSD2) I also tried switching the two disks and the system log actually noticed that the two disks where swapped, but still no luck in getting the cache up and running.

What I am missing now is some documentation on the cache behaviour. I do have backups and there wasn't much new data written. So my question is, when is newly written data saved/moved from the cache raid group over to the HDD? If that happens periodically (and not only after 256GB have been filled in the cache), would it be save to completely remove and delete the cache raid group anyways?

Then get the HDD RAID5 up and running again (without cache), check the last changes from the last backup (2:00 am the day before the crash), and then, when everything looks good, re-initated the RAID1 for the cache as completely new raid-group (empty) and then add it back in as cache.

I am simple afraid of breaking something with the cache, so I haven't done anything yet.
--
QLocker survivor; backup enthusiast
TS-351 with 5. + something FW (always up2date)
Celeron J1800/8GB RAM
RAID-5 (2x 256GB Transcent TS256GMTE110S + 3x 6TB Seagate ST6000VN001)
Post Reply

Return to “System & Disk Volume Management”