Hello! I've recently had an NVME disk go into read-only mode, but only with the installed OS. As if i didn't have the privileges to write to it. This happened after unsuccessfully trying to make a virtual machine with virt-manager.
Using a live USB, i was able to format, create partitions and copy files to the disk.
SMART test says PASSED, but idk what the errors at the end mean...
What happened to the disk? Is it dying or was this just some OS related thing, because, i can now write to it, so it didn't go into read-only mode.
I'm confused.
liveuser@eos-2024.06.25 ~]$ sudo smartctl -a /dev/nvme1n1 smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.9.6-arch1-1] (local build) Copyright (C) 2002-23, Bruce Allen, Christian Franke, === START OF INFORMATION SECTION === Model Number: KINGSTON SNV2S500G Serial Number: 50026B7282DB8CBD Firmware Version: SBI02102 PCI Vendor/Subsystem ID: 0x2646 IEEE OUI Identifier: 0x0026b7 Controller ID: 1 NVMe Version: 1.4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 500,107,862,016 [500 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 0026b7 282db8cbd5 Local Time is: Mon Sep 9 10:11:11 2024 CEST Firmware Updates (0x12): 1 Slot, no Reset required Optional Admin Commands (0x0016): Format Frmw_DL Self_Test Optional NVM Commands (0x009f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Verify Log Page Attributes (0x12): Cmd_Eff_Lg Pers_Ev_Lg Maximum Data Transfer Size: 64 Pages Warning Comp. Temp. Threshold: 83 Celsius Critical Comp. Temp. Threshold: 90 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 5.00W - - 0 0 0 0 0 0 1 + 3.50W - - 1 1 1 1 0 200 2 + 2.50W - - 2 2 2 2 0 1000 3 - 1.50W - - 3 3 3 3 5000 5000 4 - 1.50W - - 4 4 4 4 20000 70000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 41 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 4% Data Units Read: 41,790,029 [21.3 TB] Data Units Written: 54,619,808 [27.9 TB] Host Read Commands: 304,394,960 Host Write Commands: 864,809,806 Controller Busy Time: 33,318 Power Cycles: 1,755 Power On Hours: 8,843 Unsafe Shutdowns: 19 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged Read Self-test Log failed: Invalid Field in Command (0x002) www.smartmontools.org
I haven't figured this out completely yet, hard to find the info and hard to interpret the info (for my limited brain at least), but let's go from this:
Error Information (NVMe Log 0x01, 16 of 256 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 129 0 0xa013 0x8004 0x000 0 0 -
I am at this point interested in the status code. I found this: https://github.com/linux-nvme/nvme-cli/issues/800 and I am looking at answer from "birkelund" (NVMe Software Engineer), someone is asking for status code 0xc502 and he explains you decode like so:
If you are asking how that error code is encoded in 0xC502, then its 0xC502 >> 1 to get rid of the Phase Tag. That leave us with 0x6281. Then apply a mask of 0x7ff to extract the lower 11 bytes (3 for the Status Code Type and 8 for the Status Code), ending up with 0x281. 0x2xx are "Media and Data Integrity Errors" and the 0x81 status code is "Unrecovered Read Error".
In more human language the bit shift is a division by 2 (0xC502 / 2 = 0x6281). Applying mask 0x7ff gives us the 3 right-side nibbles (0x6281 -> 0x281). I think this makes decoding a tad easier.
To see how he gets Status Code Type and Status Code I did some more research:
Lookup for code types:
NVME_SCT_GENERIC = 0x0,
NVME_SCT_COMMAND_SPECIFIC = 0x1,
NVME_SCT_MEDIA_ERROR = 0x2,
/* 0x3-0x6 - reserved */
NVME_SCT_VENDOR_SPECIFIC = 0x7,
Lookup for "Media and Data Integrity Errors":
NVME_SC_WRITE_FAULTS = 0x80,
NVME_SC_UNRECOVERED_READ_ERROR = 0x81,
NVME_SC_GUARD_CHECK_ERROR = 0x82,
NVME_SC_APPLICATION_TAG_CHECK_ERROR = 0x83,
NVME_SC_REFERENCE_TAG_CHECK_ERROR = 0x84,
NVME_SC_COMPARE_FAILURE = 0x85,
NVME_SC_ACCESS_DENIED = 0x86,
So we can now see where he gets:
0x2xx are "Media and Data Integrity Errors" and the 0x81 status code is "Unrecovered Read Error"
If I apply same logic/method to status 0x8004 I get something like:
Shift Right: Shifting the value 0x8004 by one bit to the right (0x8004 >> 1), gets 0x4002. Then the masking: Applying a mask of 0x7ff extracts the lower 11 bits of the value 0x4002, yielding 0x002.
So 0 gets us NVME_SCT_GENERIC Status Code Type (see above table) and Generic Status Codes lookup:
NVME_SC_SUCCESS = 0x00,
NVME_SC_INVALID_OPCODE = 0x01,
NVME_SC_INVALID_FIELD = 0x02,
NVME_SC_COMMAND_ID_CONFLICT = 0x03,
NVME_SC_DATA_TRANSFER_ERROR = 0x04,
NVME_SC_ABORTED_POWER_LOSS = 0x05,
NVME_SC_INTERNAL_DEVICE_ERROR = 0x06,
NVME_SC_ABORTED_BY_REQUEST = 0x07,
NVME_SC_ABORTED_SQ_DELETION = 0x08,
NVME_SC_ABORTED_FAILED_FUSED = 0x09,
NVME_SC_ABORTED_MISSING_FUSED = 0x0a,
NVME_SC_INVALID_NAMESPACE_OR_FORMAT = 0x0b,
NVME_SC_COMMAND_SEQUENCE_ERROR = 0x0c,
NVME_SC_LBA_OUT_OF_RANGE = 0x80,
NVME_SC_CAPACITY_EXCEEDED = 0x81,
NVME_SC_NAMESPACE_NOT_READY = 0x82,
So, now we have type error 0 (NVME_SCT_GENERIC = 0x0) and 02 for status (NVME_SC_INVALID_FIELD = 0x02). I have no idea what it means, but to me it does not sound like an issue with your NVMe drive itself. If SMART is 'clean' I think you have not much to worry about.
Looking for a further explanation I found:
NVME_SC_INVALID_FIELD - Invalid Field in Command: A reserved coded value or an unsupported value in a defined field.
Also, as far as I can tell CmdId (0xa013 in this case) is not an actual command but an ID for a 'structure' or packet that contains an actual command and parameters that you can pass on to the queue. So in itself CmdId 0xa013 tells us nothing about the actual command the host was sending to the drive.
Disclaimer: Math, bit-shifting and all that is not my strong point so I may have made an error, a typo or whatever in the calculator, you should check it before relying on it.
A more complete explanation and codes on a different question / answer on superuser dot com, here:
Unable to identify SMART errors/issues of my NVMe disk
I recently did a random smart check on my root NVME drive (Samsung PM961). The smart overall-health self-assessment says PASSED. But there are error information log entries. Not sure if I should be worried or carry on using it.
The smartctl message:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.40-1-MANJARO] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: SAMSUNG MZVLW512HMJP-00000 Serial Number: S33UNB0XXXXXXX Firmware Version: CXY7501Q PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 512,110,190,592 [512 GB] Unallocated NVM Capacity: 0 Controller ID: 2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] Namespace 1 Utilization: 151,663,030,272 [151 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 002538 b771b5b78b Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Warning Comp. Temp. Threshold: 68 Celsius Critical Comp. Temp. Threshold: 71 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 7.60W - - 0 0 0 0 0 0 1 + 6.00W - - 1 1 1 1 0 0 2 + 5.10W - - 2 2 2 2 0 0 3 - 0.0400W - - 3 3 3 3 210 1500 4 - 0.0050W - - 4 4 4 4 2200 6000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 36 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 2% Data Units Read: 24,216,722 [12.3 TB] Data Units Written: 30,089,597 [15.4 TB] Host Read Commands: 284,196,394 Host Write Commands: 308,601,738 Controller Busy Time: 1,277 Power Cycles: 1,259 Power On Hours: 2,837 Unsafe Shutdowns: 19 Media and Data Integrity Errors: 0 Error Information Log Entries: 679 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 36 Celsius Temperature Sensor 2: 45 Celsius Error Information (NVMe Log 0x01, max 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 679 0 0x0010 0x4004 - 0 0 - 1 678 0 0x0010 0x4004 - 0 0 - 2 677 0 0x0010 0x4004 - 0 0 - 3 676 0 0x0010 0x4004 - 0 0 - 4 675 0 0x0010 0x4004 - 0 0 - 5 674 0 0x0010 0x4004 - 0 0 - 6 673 0 0x0010 0x4004 - 0 0 - 7 672 0 0x0010 0x4004 - 0 0 - 8 671 0 0x0010 0x4004 - 0 0 - 9 670 0 0x0010 0x4004 - 0 0 - 10 669 0 0x0010 0x4004 - 0 0 - 11 668 0 0x0010 0x4004 - 0 0 - 12 667 0 0x0010 0x4004 - 0 0 - 13 666 0 0x0010 0x4004 - 0 0 - 14 665 0 0x0010 0x4004 - 0 0 - 15 664 0 0x0010 0x4004 - 0 0 - ... (48 entries not shown)
The nvme-cli error-log message (same message for all entries):
................. Entry[63] ................. error_count : 616 sqid : 0 cmdid : 0x1c status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field) parm_err_loc : 0x2c lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0
I tried dd to test speed
dd if=/dev/zero of=/home/XXX/test.img bs=1G count=1 oflag=dsync
the average speed is around 700 MB/s, which is quite a bit lower than the rated value. I also used the same command to test two of my other newer NVME drive, both reported normal speed.
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.5015 s, 715 MB/s
I also tried
sudo hdparm -t /dev/nvme1n1
The speed is around 2000 MB/s which is about the rated value
Timing buffered disk reads: 6246 MB in 3.00 seconds = 2081.63 MB/sec
The system runs fine. But the error log is bugging me. Should I replace the disk or keep using it? If it should be replaced as a root disk, can I use it for other purposes?
I had the same problem. In my case, S.M.A.R.T had been working properly on the device for years while using Ubuntu 12.04, and then under Ubuntu 14.04 it happened exactly what you tell in the question.
The problem is related to a new kernel module that was introduced in Linux Kernel 3.15 called uas (USB Attached SCSI) (see release announcement).
That module is now the responsible of managing USB Mass Storage Devices. There is a thread where people complain that uas in kernel 3.15 is causing their USB devices to fail. Another one says that it might be the cause of S.M.A.R.T problems.
Fortunately, those problems seem to be gone at kernel 3.19 (which I am using), as my device is being detected correctly. Only the S.M.A.R.T problem remains.
To fix it, you need to disable the use of uas module for the given device.
Disable uas without rebooting
First, unplug all USB devices that might be using it. Then, remove the uas and usb-storage modules:
sudo modprobe -r uas
sudo modprobe -r usb-storage
Then, load usb-storage module with a parameter that tells it to not use uas for a given device:
sudo modprobe usb-storage quirks=VendorId:ProductId:u
VendorId and ProductId must be replaced by your device vendor and product id, which can be obtained with lsusb command (they are the characters after ID).
For example, I have the following device:
Bus 002 Device 011: ID 0bc2:3320 Seagate RSS LLC SRD00F2 [Expansion Desktop Drive]
So my vendor id is 0bc2, and my product id is 3320. My command is:
sudo modprobe usb-storage quirks=0bc2:3320:u
The last u tells usb-storage to ignore uas for the device (see source).
At this point, you can insert the USB device, and it will know not to use uas, making S.M.A.R.T work properly. You will see lines like these in dmesg when inserting the USB device:
usb 2-2: UAS is blacklisted for this device, using usb-storage instead
usb-storage 2-2:1.0: USB Mass Storage device detected
usb-storage 2-2:1.0: Quirks match for vid 0bc2 pid 3320: 800000
scsi host12: usb-storage 2-2:1.0
Make the change permanent
The previous quirk will only last until you reboot the system. To make it persistent, you need to follow the steps described here, which I copy below:
First, create a file named ignore_uas.conf in the /etc/modprobe.d/ directory with the following content:
options usb-storage quirks=VendorId:ProductId:u
As before, substitute VendorId and ProductId by your device vendor and product id obtained from lsusb.
Next, regenerate your inital ramdisk:
mkinitcpio -p linux
or, on newer Ubuntu versions:
sudo update-initramfs -u
Finally, reboot your computer.
Edit: More background on the issue, and another way to get around it without disabling uas (which has better throughput than usb-storage) can be found here: https://www.smartmontools.org/ticket/971#comment:12
It seems that kernel is blacklisting SAT ATA PASS-THROUGH on some devices when running in uas mode, as they have broken firmware.
So, the blacklisting can be disabled (at your own risk) by using the previous method I mention in the answer, but removing the final u from the quirk, ie:
quirks=VendorId:ProductId:
Please note, however, that I have not tested this approach.
External drives (via USB, I assume) are tricky with SMART. Some don't work at all. The smartmontools people posted a list of hard drives with command-line switches to add to smartctl (see fifth column).
For Seagate Expansion drives in particular, it looks like you need either -d sat or -d sat,12. Try the following:
sudo smartctl -d sat --all /dev/sdb
sudo smartctl -d sat,12 --all /dev/sdb
If one of those works, it tells you which -d switch to add to your smartctl commands.
The SMART status may not be available for the namespace (n1) or partition (p2). Hence you must call it for the device itself:
smartctl -x /dev/nvme0
You can override the namespace to be queried for with -d nvme,$nsid, and 0xffffffff is the "broadcast namespace id". By default smartctl selects $nsid from the device node namespace id (in your case 1).
So to query with broadcast:
smartctl -x -d nvme,0xffffffff /dev/nvme0n1p2
You need to run this: sudo smartctl -a /dev/nvme0
