Help a QUAI hacker out - non-standard hardware.

QNAP now provides a groundbreaking AI computing platform based on QNAP NAS called QuAI (pronounced "Q A I" ) - QNAP's AI Developer Package.
Data Mining/Machine Learning/Deep Learning
sboger
New here
Posts: 6
Joined: Sun Sep 20, 2015 5:42 pm

Help a QUAI hacker out - non-standard hardware.

Postby sboger » Tue Oct 02, 2018 12:05 pm

My TS-563 backplane failed and QNAP was nice enough to help me with a new model. Thanks QNAP!

So, my little 563 workhorse is now sitting here all alone with a zero load...

SO LETS PUT QUAI ON IT!!!

First, comments I don't need from you: "This is not supported hardware.", "The 563 does not support...", "ZOMG!!! YOU HAXXOR!!!"

SO, I bought a supported card:
ASUS Geforce GTX 1050 Ti 4GB Phoenix Fan Edition DVI-D HDMI DP 1.4 Gaming Graphics Card (PH-GTX1050TI-4G) Graphic Cards

The power connector on the motherboard blocks the install of the GPU by a measely 6mm. So I have a PCIe extension on it.

Components not in TS-563 app center, so manual install of:
QuAI_0.9.1.67_x86_64.zip
NVIDIA_GPU_DRV_1.2.180625_x86_64.zip

Both complete successfully.

I altered the nvidia init file faking the SUMMITRIDGE platform like so and rebooted:

Code: Select all

[~] # head -10 /share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/NVIDIA_GPU_DRV.sh
#!/bin/sh
CONF=/etc/config/qpkg.conf
QPKG_NAME="NVIDIA_GPU_DRV"
QPKG_ROOT=`/sbin/getcfg $QPKG_NAME Install_Path -f ${CONF}`
DRIVER_ROOT=/opt/${QPKG_NAME}
APACHE_ROOT=/share/`/sbin/getcfg SHARE_DEF defWeb -d Qweb -f /etc/config/def_share.info`
GPUHAL_CMD_L=`gpuhal_app -l`
GPUHAL_CMD="gpuhal_app"
PLATFORM="Platform = X86_SUMMITRIDGE" #`cat /etc/platform.conf | grep "Platform"`
DATE_CMD=`cat /etc/default_config/uLinux.conf | grep "Build Number"`



QUAI runs, and actually passes all the tests. But Jupyterhub_0.9.2_x86_64.zip doesnt install correctly:

Code: Select all

[/share/CACHEDEV1_DATA/.qpkg] # cat Jupyterhub_install.log
[2018-10-01 05:01:39] NVIDIA gpu is exist
[2018-10-01 05:01:39] device_index=1
[2018-10-01 05:01:39] vendor_name=NVIDIA Corporation
[2018-10-01 05:01:39] device_name=GP107 [GeForce GTX 1050 Ti]
[2018-10-01 05:01:39] igpu_egpu=external
[2018-10-01 05:01:39] mode_support=7
[2018-10-01 05:01:39] gpu support QTS
[2018-10-01 05:01:40] active_status=4
[2018-10-01 05:01:40] real_status=0
[2018-10-01 05:01:40] driver_gpkg=NVIDIA_GPU_DRV
[2018-10-01 05:01:40] driver_installed=1
[2018-10-01 05:01:40] gpu status isn't QTS mode
[/share/CACHEDEV1_DATA/.qpkg] #


'Control Panel -> Hardware -> Graphics card' is set to QTS.

Modules:

Code: Select all

[~] # lsmod
Module                  Size  Used by    Tainted: P 
xt_conntrack            3401  3
xt_ipvs                 1899  0
ip_vs_rr                1511  0
ip_vs_ftp               4045  0
ip_vs                 101308  6 xt_ipvs,ip_vs_rr,ip_vs_ftp
xt_nat                  1977  5
xt_addrtype             2957  7
rfcomm                 50263  0
ib_iser                38152  0
rdma_cm                41925  1 ib_iser
ib_cm                  33299  1 rdma_cm
iw_cm                  30404  1 rdma_cm
mlx5_ib               177935  0
mlx5_core             475991  1 mlx5_ib
mlx4_ib               174300  0
mlx4_core             283832  1 mlx4_ib
ib_core               185882  6 ib_iser,rdma_cm,ib_cm,iw_cm,mlx5_ib,mlx4_ib
mlx_compat              1497  9 ib_iser,rdma_cm,ib_cm,iw_cm,mlx5_ib,mlx5_core,mlx4_ib,mlx4_core,ib_core
iscsi_tcp               8942  0
libiscsi_tcp           12474  1 iscsi_tcp
libiscsi               38321  3 ib_iser,iscsi_tcp,libiscsi_tcp
scsi_transport_iscsi    66769  4 ib_iser,iscsi_tcp,libiscsi
iscsi_target_mod      254552  0
target_core_file       11860  0
target_core_iblock      9938  0
target_core_mod       392400  3 iscsi_target_mod,target_core_file,target_core_iblock
iscsi_target_qlog       2524  2 iscsi_target_mod
fbdisk                 20441  0
xt_LOG                  1807  0
ipt_MASQUERADE          1533  9
xt_REDIRECT             1742  0
nf_nat_redirect         1267  1 xt_REDIRECT
iptable_nat             1959  2
nf_nat_masquerade_ipv4     1865  1 ipt_MASQUERADE
nf_nat_ipv4             5147  1 iptable_nat
nf_nat                 11914  5 ip_vs_ftp,xt_nat,nf_nat_redirect,nf_nat_masquerade_ipv4,nf_nat_ipv4
xt_policy               2522  0
ipvlan                 13356  0
dummy                   3159  0
br_netfilter           13236  0
bridge                 80620  1 br_netfilter
stp                     1693  1 bridge
bonding               114377  0
xt_mark                 1317  8
xt_set                  7738  6
ip_set_hash_net        25076  7
ip_set                 27079  2 xt_set,ip_set_hash_net
xt_connmark             1821  2
8021q                  17067  0
ipv6                  317533 70 rdma_cm,ib_core,ipvlan,bridge,[permanent]
uvcvideo               73142  0
videobuf2_vmalloc       5094  1 uvcvideo
videobuf2_memops        2215  1 videobuf2_vmalloc
videobuf2_core         33372  1 uvcvideo
snd_usb_caiaq          39170  0
snd_usb_audio         139029  0
snd_usbmidi_lib        20836  1 snd_usb_audio
snd_seq_midi            5478  0
snd_rawmidi            18725  3 snd_usb_caiaq,snd_usbmidi_lib,snd_seq_midi
fnotify                27085  0
udf                    77803  0
isofs                  31658  0
sp5100_tco              5952  1
iTCO_wdt                5764  0
kcopy                  17895  0
nvidia_uvm            600346  0
nvidia_modeset        841851  0
nvidia              13054176  2 nvidia_uvm,nvidia_modeset
vfio_pci               27672  0
vfio_virqfd             2165  1 vfio_pci
vfio_iommu_type1        8294  0
vfio                   14991  2 vfio_pci,vfio_iommu_type1
ufsd                  652732  0
jnl                    27383  1 ufsd
pl2303                 11696  0
usbserial              29013  1 pl2303
qm2_i2c                 4415  0
intel_ips              11476  0
drbd                  331130  2
flashcache            142820  1
dm_tier_hro_algo       14164  1
dm_thin_pool          157020  4 target_core_mod,dm_tier_hro_algo
dm_bio_prison           4372  1 dm_thin_pool
dm_persistent_data     49941  1 dm_thin_pool
hal_netlink             4853  0
k10temp                 4830  0
igb                   162846  0
e1000e                198727  0
mpt3sas               171290  0
mpt2sas               169679  0
scsi_transport_sas     24764  2 mpt3sas,mpt2sas
raid_class              3572  2 mpt3sas,mpt2sas
usb_storage            49870  0
xhci_pci                4650  0
xhci_hcd              135128  1 xhci_pci
usblp                  12346  0
uhci_hcd               32595  0
ehci_pci                4359  0
ehci_hcd               60621  1 ehci_pci
[~] #


nvidia-drm.ko is missing.

Let's see what insmod says:

Code: Select all

[~] # insmod /share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/kernel_modules/2018032>
insmod: can't insert '/share/CACHEDEV1_DATA/.qpkg/NVIDIA_GPU_DRV/kernel_modules/20180327_4.3.4/X86_SUMMITRIDGE/nvidia-drm.ko': unknown symbol in module, or unknown parameter


Unknown symbol error...

I find this in dmesg:

Code: Select all

[17101.481534] nvidia_drm: Unknown symbol drm_atomic_clean_old_fb (err 0)



Here's my systems additional info:

Code: Select all

Linux NASFA86A9 4.2.8 #1 SMP Fri Sep 14 01:20:07 CST 2018 x86_64 GNU/Linux


Code: Select all

Model:TS-563
Current firmware version:4.3.5.0699
Date:2018/09/14


Code: Select all

[~] # gpuhal_app -l   
1:NVIDIA Corporation:GP107 [GeForce GTX 1050 Ti]:external:7
[~] #


Code: Select all

[~] # gpuhal_app -L 1 
temperature=27
usage=2
memory=0/4038
fan=0
power=[Unknown/75.00
[~] #



So, any QNAP Engineers that hang out here have any thoughts? The other nvidia modules, nvidia, nvidia_uvm, and nvidia_modeset install fine.

sboger
New here
Posts: 6
Joined: Sun Sep 20, 2015 5:42 pm

Re: Help a QUAI hacker out - non-standard hardware.

Postby sboger » Tue Oct 02, 2018 1:17 pm

Well now, look at this....

I went into QUAI app and looked at the commandline code and ran:

Code: Select all

docker pull tensorflow/tensorflow:1.4.1-gpu
GPU=nvidia0 gpu-docker run -it --rm tensorflow/tensorflow:1.4.1-gpu bash


POOF! It shows up in Container Station:

Screenshot_2018-10-01_22-08-37.png


In the container:

Code: Select all

root@0665e81411d6:/data/TensorFlow-Examples/examples# python ./2_BasicModels/logistic_regression.py
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/data/train-images-idx3-ubyte.gz

Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
2018-10-02 04:07:45.974710: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2018-10-02 04:07:46.919212: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:878] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-10-02 04:07:46.919948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.392
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.89GiB
2018-10-02 04:07:46.920035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-10-02 04:07:46.920084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:732] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.
Epoch: 0001 cost= 1.184197385
Epoch: 0002 cost= 0.665345256
Epoch: 0003 cost= 0.552803895
Epoch: 0004 cost= 0.498642099
Epoch: 0005 cost= 0.465526774
Epoch: 0006 cost= 0.442619401
Epoch: 0007 cost= 0.425524296
Epoch: 0008 cost= 0.412204633
Epoch: 0009 cost= 0.401410712
Epoch: 0010 cost= 0.392345204
Epoch: 0011 cost= 0.384807194
Epoch: 0012 cost= 0.378223047
Epoch: 0013 cost= 0.372398587
Epoch: 0014 cost= 0.367228177
Epoch: 0015 cost= 0.362676739
Epoch: 0016 cost= 0.358578357
Epoch: 0017 cost= 0.354859857
Epoch: 0018 cost= 0.351486227
Epoch: 0019 cost= 0.348287183
Epoch: 0020 cost= 0.345455661
Epoch: 0021 cost= 0.342704777
Epoch: 0022 cost= 0.340260157
Epoch: 0023 cost= 0.337901310
Epoch: 0024 cost= 0.335739347
Epoch: 0025 cost= 0.333669664
Optimization Finished!
Accuracy: 0.9137
root@0665e81411d6:/data/TensorFlow-Examples/examples#


Seems to be working via commandline. So the issue is container station giving me this error when I try to pull it from the WEBUI - makes no difference if I set the graphic card to QTS, CS, or VS...

Screenshot_2018-10-01_22-15-46.png
You do not have the required permissions to view the files attached to this post.

sboger
New here
Posts: 6
Joined: Sun Sep 20, 2015 5:42 pm

Re: Help a QUAI hacker out - non-standard hardware.

Postby sboger » Mon Oct 08, 2018 8:05 am

Cool. Jupyter notebook is built-in to the docker image for tensorflow-gpu. I just forwarded 8888:8888 and it comes right up.

Screenshot_2018-10-07_17-03-56.png
You do not have the required permissions to view the files attached to this post.


Return to “QuAI”

Who is online

Users browsing this forum: No registered users and 1 guest