rx overruns on TS-451D2

Don't miss a thing. Post your questions and discussion about other uncategorized NAS features here.
Post Reply
User avatar
nightcustard
Starting out
Posts: 25
Joined: Wed Dec 05, 2007 5:40 am
Location: Blackpool UK

rx overruns on TS-451D2

Post by nightcustard »

Hi!

I wonder whether a good person can put my mind at rest about the significance of rx overruns?

[~] # ifconfig
eth0 Link encap:Ethernet HWaddr 24:5E:[snip]
inet addr:192.168.[snip] Bcast:192.168.27.255 Mask:255.255.255.0
inet6 addr: fe80::[snip]:a1c5/64 Scope:Link
inet6 addr: fd92:[snip]:beff:fe41:a1c5/64 Scope:Global
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:80234105 errors:0 dropped:3 overruns:2184 frame:0
TX packets:35287324 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:115408741367 (107.4 GiB) TX bytes:13518076889 (12.5 GiB)
Memory:a1100000-a111ffff

I had previously increased the rx buffer size via ethtool, which appeared to have solved the issue but a manual reboot a few days ago must have reset it :cry:

[~] # ethtool -G eth0 rx 1024
[~] # ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 1024 [this was 256]
RX Mini: 0
RX Jumbo: 0
TX: 256

Can anyone please advise if overruns can lead to a loss of, or corrupted data? I would presume not as I would expect the interface hardware to flag missed data and the hardware at the transmit end to re-send it, but as I learn more, I realise how little I know!
If there is the possibility of data loss, then it's something I should pursue with Qnap, otherwise I'll just keep an occasional eye on it and maybe increase the buffer size when I remember....
P3R
Guru
Posts: 13190
Joined: Sat Dec 29, 2007 1:39 am
Location: Stockholm, Sweden (UTC+01:00)

Re: rx overruns on TS-451D2

Post by P3R »

nightcustard wrote: Thu May 13, 2021 1:40 am Can anyone please advise if overruns can lead to a loss of, or corrupted data?
No, not if using well established protocols. They may error out but should handle the issue without any data loss.
If there is the possibility of data loss, then it's something I should pursue with Qnap, otherwise I'll just keep an occasional eye on it and maybe increase the buffer size when I remember....
I've never heard of the issue. Could it be that you're overloading the unit with far too many apps, run out of RAM and the disk swapping make all I/O slow?
RAID have never ever been a replacement for backups. Without backups on a different system (preferably placed at another site), you will eventually lose data!

A non-RAID configuration (including RAID 0, which isn't really RAID) with a backup on a separate media protects your data far better than any RAID-volume without backup.

All data storage consists of both the primary storage and the backups. It's your money and your data, spend the storage budget wisely or pay with your data!
Mousetick
Experience counts
Posts: 1081
Joined: Thu Aug 24, 2017 10:28 pm

Re: rx overruns on TS-451D2

Post by Mousetick »

It depends on the network protocol being used.

With TCP: no data loss or corruption. Only performance degradation, as lost or discarded packets need to be retransmitted.
With UDP: data loss, which may or may not be detected and handled by the application.
With other protocols: YMMV.
User avatar
nightcustard
Starting out
Posts: 25
Joined: Wed Dec 05, 2007 5:40 am
Location: Blackpool UK

Re: rx overruns on TS-451D2

Post by nightcustard »

Thanks for the info! The NAS is almost fresh out of its box, none of the multimedia apps are enabled and RAM is showing typically under 15% usage. MSNetworking, HBS3, Qufirewall, NFS service and Malware remover are the top five apps - virtually everything else is disabled. The overruns seem to occur when my partner is copying her video files from SD to the NAS, which she does using MS file explorer on her Win10 PC. I presume the protocol will be TCP in that case. The network is Gigabit ethernet.
This makes me think I need to check disk write performance - I'll use the Storage resource / Disk activity / throughput part of Resource Monitor unless advised otherwise. Any tips for an enthusiastic amateur?
P3R
Guru
Posts: 13190
Joined: Sat Dec 29, 2007 1:39 am
Location: Stockholm, Sweden (UTC+01:00)

Re: rx overruns on TS-451D2

Post by P3R »

nightcustard wrote: Sat May 15, 2021 4:43 am The overruns seem to occur when my partner is copying her video files from SD to the NAS, which she does using MS file explorer on her Win10 PC. I presume the protocol will be TCP in that case.
Correct, so no data loss.
This makes me think I need to check disk write performance...
Good idea!

What disks in what configuration do you have? Any SSD caching in play here?
RAID have never ever been a replacement for backups. Without backups on a different system (preferably placed at another site), you will eventually lose data!

A non-RAID configuration (including RAID 0, which isn't really RAID) with a backup on a separate media protects your data far better than any RAID-volume without backup.

All data storage consists of both the primary storage and the backups. It's your money and your data, spend the storage budget wisely or pay with your data!
User avatar
nightcustard
Starting out
Posts: 25
Joined: Wed Dec 05, 2007 5:40 am
Location: Blackpool UK

Re: rx overruns on TS-451D2

Post by nightcustard »

I've got four Toshiba MG06ACA800E 8TB drives in a RAID 10 configuration. No SSD caching. They are arranged as an encrypted static single volume.
Mousetick
Experience counts
Posts: 1081
Joined: Thu Aug 24, 2017 10:28 pm

Re: rx overruns on TS-451D2

Post by Mousetick »

What is the rate of dropped/overruns? From your first post we only have a snapshot, and they are very low relative to total number of packets. Furthermore, do they increase continuously or are they just temporary "blips"?

If they are very low and random, they may not be worth worrying about. What is the NAS connected to?

Overruns should never happen regularly on modern hardware. Possible causes:
- NAS is completely overloaded
- Bad flow control configuration on either end's Ethernet adapter
- Faulty Ethernet adapter or driver on NAS
- Faulty Ethernet adapter or driver on other end

* The other end referring to the other end of the Ethernet cable, such as a switch, not necessarily to the NAS client.
User avatar
nightcustard
Starting out
Posts: 25
Joined: Wed Dec 05, 2007 5:40 am
Location: Blackpool UK

Re: rx overruns on TS-451D2

Post by nightcustard »

Hi Mousetick - thanks for your email - the NAS is connected to a TPLink managed Gigabit switch as is the PC used to transfer the video files.

I ran several ifconfigs during the file transfer (in the 256 buffer size case) and the overruns appeared to increase on each reading.

Regarding the possible causes you listed:
- NAS is completely overloaded - I think we can eliminate this based the CPU & RAM usage figures
- Bad flow control configuration on either end's Ethernet adapter. I'll do some homework on this but both Windows 10 PCs TCP/IP settings are the defaults. I have played with IGMP snooping on my router (without really knowing what I was doing(!)) - could this be a factor?
- Faulty Ethernet adapter or driver on NAS - possible...
- Faulty Ethernet adapter or driver on other end - this could be at the switch end. I've tested with two originating PCs, both on the same switch as the NAS. Still leaves the switch (and cable) as possible culprits

The testing performed this afternoon is summarised as follows:

Rx buffer size set to 1024 (using ethtool -G eth0 rx 1024): Loaded 32 video files (total 24.9GB) from an SD card on the PC: eth0 rx overruns: 0

Rx buffer size set to 256 (default): Same 32 video files: eth0 rx overruns 4823

During both tests, the NAS Resource Monitor reported the following (no significant differences between the two):

Physical network usage: rx 33.6MB/s
Average CPU usage 8% to 14%
System resource memory: 14% max
Storage Resource:
Disk Activity throughput: Write 16.4MB/s
Latency (max): all disks typically 70ms
IOPS (max): all disks typically write 40/s

I conducted some iPerf3 tests to check raw throughput:
NAS as rx - average bandwidth 943Mb/s (NAS as tx - average bandwidth 949Mb/s)
There were no resulting rx overruns reported even after an extended 200 sec 20GB rx test.

I also did a NAS internal file copy to check disk throughput without network and other constraints:

Same 32 files, 24.9GB as used above:

Storage Resource:
Disk Activity throughput: Read 45.6MB/s; Write 74.4MB/s
Latency (max): all disks typically 55ms
IOPS (max): all disks typically Read 90/s Write 180/s

Copied the same files, this time from PC SSD drive to the NAS:
Network typically 60MB/s
Disk Activity throughput: Write 27MB/s
Latency (max) 55ms
IOPS write 64/s

rx buffer 256: rx overruns 3564
rx buffer 512: rx overruns 49 (overruns happened in a burst)
rx buffer 1024: rx overruns 0

My conclusions so far is that the disks themselves are not a bottleneck based upon the 74MB/s internal file copy write performance seen above. The basic ethernet receiver speed is fine. A small number of overruns occurred in a burst with a buffer size of 512. The CPU is not a constraint, neither is the RAM.
No overrun errors are seen with the buffer size increased to 1024.

A quick check on the TPLink ethernet switch admin diagnostics panel showed no errors and I haven't set up any QoS - every port has the same priority. I'll see if removing it changes anything.

Just a final thing... I do have a just retired TS-269 Pro NAS. I could power that up again and do a similar test on it to see what its network interface reports. That might eliminate (or otherwise) a number of the possible culprits listed above.

Regards, Mike
Mousetick
Experience counts
Posts: 1081
Joined: Thu Aug 24, 2017 10:28 pm

Re: rx overruns on TS-451D2

Post by Mousetick »

I can observe the same behavior on a QNAP NAS with similar specs as yours, except the CPU is a quad-core Celeron 2 GHz base / 2.4 GHz turbo. This NAS is connected to a Windows 10 PC via two dumb switches. It uses Intel i210 GigE NICs and firmware 4.5.2.1630. What about yours?

So this got me curious and I went down the rabbit hole of figuring out what the heck may be going on. I think I was wrong in assuming this was likely a hardware-borne issue, so the possible causes I listed previously are pretty much irrelevant. After much Googling and some ad-hoc testing, here's my take on it, assuming your NAS is using Intel NICs and the Intel igb Linux driver:

Code: Select all

# ethtool -i eth0
driver: igb
version: 5.4.0-k
firmware-version: 3.16, 0x800004d8
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
First, the rx overruns reported by ifconfig do not mean that packets were lost. There are several layers between the NIC, drivers, kernel and the ifconfig tool which only reports stats that have been combined and mistranslated through the various layers. A better way to look at the NIC statistics is with the ethtool tool which queries the driver directly:

Code: Select all

# ethtool -S eth0
NIC statistics:
     rx_packets: 94297199
...
     rx_crc_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
...
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
...
     rx_flow_control_xon: 10487
     rx_flow_control_xoff: 10855
...
     rx_errors: 0
...
     rx_length_errors: 0
     rx_over_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 1941
...
Notice the rx_missed_errors and rx_fifo_errors above. There are 2 types of buffers involved: one hardware buffer (36 KB on i210) on the NIC receiving raw packets and a set of buffers stored in RAM (so-called ring buffers because they are arranged in a ring structure) on the NAS used by the driver to transfer packets from/to the NIC to/from the kernel for processing by the networking stack.

When the NIC tries to store received packets into the ring buffers but none are available, this increments the rx_fifo_errors counter, but the packets are not necessarily lost, they continue to be stored into the NIC's hardware buffer as long as there is room available. It's only when the hardware buffer is full that the NIC drops packets, and these are recorded in the rx_missed_errors counter.

In the example above, despite 1941 fifo errors, there are 0 missed errors, which means eventually the driver and kernel were able to flush the ring buffers and make enough of them available again for the NIC to push more data through. The NIC is filling the ring buffers faster than the kernel and driver are flushing them: this is a software issue.

Some anecdotal observations based on very limited testing:
- CPU load is not a factor, as none of the cores are maxed out during transfers (~25% max utilization of a single core)
- rx_fifo_errors occur with SMB transfer whether average throughput is limited by disk I/O (~80 MB/s) or 1 GigE link is near saturation (~115 MB/s)
- strangely, overwriting an existing file with SMB rather than copying a new file didn't produce rx_fifo_errors
- the same transfers via SCP (copy over SSH) do not produce rx_fifo_errors, even though they use a lot more CPU
- xon/off flow control counters don't increase

By my calculations based on the few tests, the rate of rx_fifo_errors is 0.00001% relative to the total number of packets received. I don't know about you, but I'm not going to worry about it, especially since no packets appear to be dropped, and I'm not going to bother trying to fix it.

If you want to fix it, the 2 main recommendations are either:
1) Increase the number of buffers in the ring from 256 to whatever is adequate with ethtool -G, as you've already done. I found that 512 buffers was enough to get rid of the issue.
2) Increase the number of packets that the driver is allowed to fetch per "poll cycle" from the ring buffers and transfer to the kernel's networking stack. This parameter is called 'dev_weight' and is 64 packets by default. It can be read or written at /proc/sys/net/core/dev_weight:

Code: Select all

# cat /proc/sys/net/core/dev_weight
64
I found that a value of 80 (64 + 16) was enough to get rid of the issue.

Changing the number of ring buffers causes the NIC to reinitialize, bringing the link down and up again, while changing the dev_weight does not. Neither of these settings survive a reboot so must be reconfigured at startup, which requires some autorun shell scripting shenanigans on QNAP NASes.

There are pros and cons to changing each setting, and there are a few more parameters that can be tweaked beside those two, but I've already gone on too long with this post so I'll skip the details, which I haven't investigated very deeply anyway.

Red Hat Article on Network Performance Tuning

Further reading if you want to learn all the gory details about the Intel i210 NIC, the Intel igb driver, and the NIC/driver/kernel interactions on Linux:
- Linux network interface driver statistics
- Intel Article about warning signs for dropped packets
^ A bit outdated but still relevant today. Think of RNBC (Receive No Buffers Count) as rx_fifo_errors, and of MPC (Missed Packet Count) as rx_missed_errors.
Note: the forum censors the word "kitchen" :S, so you need to replace the ** in the URL by the word kit-chen without the dash
- Monitoring and Tuning the Linux Networking Stack: Receiving Data
^ Slightly outdated.
- Intel i210 Datasheet
- Intel igb driver for Linux source code
User avatar
nightcustard
Starting out
Posts: 25
Joined: Wed Dec 05, 2007 5:40 am
Location: Blackpool UK

Re: rx overruns on TS-451D2

Post by nightcustard »

Wow! Thanks for your detailed and interesting reply - certainly a lot to read and digest. I had already read part of the Red Hat tuning article which prompted me to try the buffer increase. You're no doubt right about the significance of these overruns, so I can relax about whether or not my new NAS is faulty (it ain't).
My NAS uses Intel i211 NICs and a Celeron J4025 CPU @ 2GHz. Firmware is 4.5.3.1652. Two desktop PCs running Windows 10 make most of the large file transfers (SMB) to the NAS via a single managed switch. I have a considerable number of wired ethernet Raspberry Pis transferring data via NFS but the traffic from them is quite light.

Code: Select all

[~] # ethtool -i eth0
driver: igb
version: 5.4.0-k
firmware-version:  0. 6-1
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
I haven't posted the results of ethtool -S eth0 as I rebooted the NAS after my testing to clear the overrun count, and the current data is uninteresting.
Presumably the autorun vehicle for any shell scripts I come up with, such as a crontab edit, will be overwritten by future firmware updates. I think I'll just have a script ready in one of the folders for manually running after updates. I have to manually enter the disk decrypt password anyway, so this would just be one extra step.
Many thanks again - I'll have fun working through it all!
Mike
Post Reply

Return to “Miscellaneous”