「mxnet-mkl」パッケージについて

公式ページにこのように書かれていた。

MXNet offers MKL pip packages that will be much faster when running on Intel hardware.

古い公式ページにはこのように書かれていた。

When using Intel Xeon CPUs for training and inference, the mxnet-mkl package is recommended.

coreプロセッサーの時はどうなんだろう?
さっそくFashion-MNISTの学習で試してみる。

  • テスト環境
Windows 10 Pro (GPUなし)
Intel(R) Core(TM) i7-4600U
RAM 8.00GB
  • 通常の「mxnet」の場合
certifi==2019.6.16
chardet==3.0.4
graphviz==0.8.4
idna==2.6
mxnet==1.6.0b20190801
numpy==1.16.4
Pillow==6.1.0
requests==2.18.4
urllib3==1.22
start training...
1  epoch  train_acc = 0.82042    test_acc = 0.82330
2  epoch  train_acc = 0.87055    test_acc = 0.86870
3  epoch  train_acc = 0.89035    test_acc = 0.88580
4  epoch  train_acc = 0.89895    test_acc = 0.89620
5  epoch  train_acc = 0.90512    test_acc = 0.89930
6  epoch  train_acc = 0.91128    test_acc = 0.90250
7  epoch  train_acc = 0.91837    test_acc = 0.91450
8  epoch  train_acc = 0.92272    test_acc = 0.91410
9  epoch  train_acc = 0.92388    test_acc = 0.91250
10 epoch  train_acc = 0.92793    test_acc = 0.91880
11 epoch  train_acc = 0.92975    test_acc = 0.91820
12 epoch  train_acc = 0.93590    test_acc = 0.92190
13 epoch  train_acc = 0.93760    test_acc = 0.92340
14 epoch  train_acc = 0.94027    test_acc = 0.92520
15 epoch  train_acc = 0.94027    test_acc = 0.92380
16 epoch  train_acc = 0.94460    test_acc = 0.92730
17 epoch  train_acc = 0.94852    test_acc = 0.92810
18 epoch  train_acc = 0.94632    test_acc = 0.92770
19 epoch  train_acc = 0.95325    test_acc = 0.93090
20 epoch  train_acc = 0.95333    test_acc = 0.92790
elapsed_time:9282.481213569641[sec]
  • 「mxnet-mkl」を使用した場合
certifi==2019.6.16
chardet==3.0.4
graphviz==0.8.4
idna==2.6
mxnet-mkl==1.6.0b20190801
numpy==1.16.4
requests==2.18.4
urllib3==1.22
start training...
1  epoch  train_acc = 0.81878    test_acc = 0.81860
2  epoch  train_acc = 0.86795    test_acc = 0.86790
3  epoch  train_acc = 0.88388    test_acc = 0.88050
4  epoch  train_acc = 0.90225    test_acc = 0.89920
5  epoch  train_acc = 0.90517    test_acc = 0.90280
6  epoch  train_acc = 0.91527    test_acc = 0.91000
7  epoch  train_acc = 0.91190    test_acc = 0.90530
8  epoch  train_acc = 0.92408    test_acc = 0.91660
9  epoch  train_acc = 0.92755    test_acc = 0.91720
10 epoch  train_acc = 0.92767    test_acc = 0.91520
11 epoch  train_acc = 0.93137    test_acc = 0.92480
12 epoch  train_acc = 0.93088    test_acc = 0.91810
13 epoch  train_acc = 0.93788    test_acc = 0.92580
14 epoch  train_acc = 0.93667    test_acc = 0.92190
15 epoch  train_acc = 0.93228    test_acc = 0.91810
16 epoch  train_acc = 0.94577    test_acc = 0.92640
17 epoch  train_acc = 0.94562    test_acc = 0.92390
18 epoch  train_acc = 0.94523    test_acc = 0.92310
19 epoch  train_acc = 0.94813    test_acc = 0.92550
20 epoch  train_acc = 0.94443    test_acc = 0.92240
elapsed_time:2942.1362974643707[sec]

9000秒と3000秒。
mxnet-mklがかなり速いことがわかる。

  • 試したコード(Fashion-MNIST)
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn, data

import time

def transform(data, label):
    return mx.nd.transpose(data.astype('float32'), (2,0,1))/255, label

train_data = data.vision.datasets.FashionMNIST(train=True, transform=transform)
train_dataloader = data.DataLoader(train_data, batch_size=100, shuffle=True)

test_data = data.vision.datasets.FashionMNIST(train=False, transform=transform)
test_dataloader = data.DataLoader(test_data, batch_size = 1000, shuffle=False)

net = nn.HybridSequential()
with net.name_scope():
    net.add(nn.Conv2D(channels=16, kernel_size=3, strides=(1, 1), padding=(1, 1), activation='relu'))
    net.add(nn.Conv2D(channels=16, kernel_size=3,  strides=(1, 1), padding=(1, 1),activation='relu'))
    net.add(nn.MaxPool2D(pool_size=2, strides=2))

    net.add(nn.Conv2D(channels=32, kernel_size=3,  strides=(1, 1), padding=(1, 1),activation='relu'))
    net.add(nn.Conv2D(channels=32, kernel_size=3,  strides=(1, 1), padding=(2, 2),activation='relu'))
    net.add(nn.MaxPool2D(pool_size=2, strides=2))

    net.add(nn.Conv2D(channels=64, kernel_size=3,  strides=(1, 1), padding=(1, 1),activation='relu'))
    net.add(nn.Conv2D(channels=64, kernel_size=3,  strides=(1, 1), padding=(1, 1),activation='relu'))
    net.add(nn.MaxPool2D(pool_size=2, strides=2))

    net.add(nn.Dense(50, activation="relu"))
    net.add(nn.Dropout(0.5))
    net.add(nn.Dense(10))
    net.add(nn.Dropout(0.5))
net.initialize(mx.init.Xavier())
net.hybridize()

def evaluate_accuracy(dataloader, net):
    sample_n = 0
    acc = []
    for batch in dataloader:
        data = batch[0]
        label = batch[1]
        output = net(data)
        predictions = mx.nd.argmax(output, axis=1).astype('int32')
        sample_n += len(data)
        acc.append(mx.nd.sum(predictions==label).asscalar())
    return sum(acc) / sample_n

loss_func = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(net.collect_params(), 'adam')

print('start training...')
start = time.time()

epochs = 20

for epoch in range(1, epochs + 1):
    for batch in train_dataloader:
        data = batch[0]
        label = batch[1]
        #ニューラルネットワークの順伝播
        with autograd.record():
            output = net(data)
            #損失を求める
            loss = loss_func(output, label)
            #損失の値から逆伝播する
            loss.backward()
        #学習ステータスをデータサイズ分進める
        trainer.step(data.shape[0])
    #ログを表示
    train_acc = evaluate_accuracy(train_dataloader, net)
    test_acc = evaluate_accuracy(test_dataloader, net)
    
    print('{:<2} epoch  train_acc = {:<10,.5f} test_acc = {:<10,.5f}'.format(epoch, train_acc, test_acc))

net.save_parameters('lstm.params')
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")

GPU搭載のパソコンでもくらべてみた

Windows 10 Pro
Intel(R) Core(TM) i7-7700K
RAM 32.0GB
NVIDIA GeForce GTX1080
CUDA9.2
cudnn7.2.1
  • 通常の「mxnet-cu92」の場合
1  epoch  train_acc = 0.83433    test_acc = 0.83510
2  epoch  train_acc = 0.87218    test_acc = 0.86980
3  epoch  train_acc = 0.88785    test_acc = 0.88480
4  epoch  train_acc = 0.88592    test_acc = 0.88030
5  epoch  train_acc = 0.90382    test_acc = 0.89900
6  epoch  train_acc = 0.90378    test_acc = 0.89780
7  epoch  train_acc = 0.91698    test_acc = 0.91190
8  epoch  train_acc = 0.91930    test_acc = 0.91320
9  epoch  train_acc = 0.91423    test_acc = 0.90700
10 epoch  train_acc = 0.92788    test_acc = 0.91560
11 epoch  train_acc = 0.92967    test_acc = 0.91770
12 epoch  train_acc = 0.92542    test_acc = 0.91350
13 epoch  train_acc = 0.93612    test_acc = 0.92090
14 epoch  train_acc = 0.93827    test_acc = 0.92280
15 epoch  train_acc = 0.93425    test_acc = 0.91950
16 epoch  train_acc = 0.94283    test_acc = 0.92890
17 epoch  train_acc = 0.94847    test_acc = 0.92630
18 epoch  train_acc = 0.94842    test_acc = 0.92960
19 epoch  train_acc = 0.94327    test_acc = 0.92220
20 epoch  train_acc = 0.95298    test_acc = 0.92840
elapsed_time:291.28389954566956[sec]
  • 「mxnet-cu92mkl」を使用した場合
1  epoch  train_acc = 0.84188    test_acc = 0.84590
2  epoch  train_acc = 0.87807    test_acc = 0.87660
3  epoch  train_acc = 0.89532    test_acc = 0.89180
4  epoch  train_acc = 0.90115    test_acc = 0.89940
5  epoch  train_acc = 0.91517    test_acc = 0.91030
6  epoch  train_acc = 0.91943    test_acc = 0.91200
7  epoch  train_acc = 0.92242    test_acc = 0.91200
8  epoch  train_acc = 0.92435    test_acc = 0.91620
9  epoch  train_acc = 0.93013    test_acc = 0.92240
10 epoch  train_acc = 0.93410    test_acc = 0.92330
11 epoch  train_acc = 0.93648    test_acc = 0.92310
12 epoch  train_acc = 0.93358    test_acc = 0.92000
13 epoch  train_acc = 0.94250    test_acc = 0.92730
14 epoch  train_acc = 0.93900    test_acc = 0.92180
15 epoch  train_acc = 0.94765    test_acc = 0.92750
16 epoch  train_acc = 0.94872    test_acc = 0.92900
17 epoch  train_acc = 0.94850    test_acc = 0.92520
18 epoch  train_acc = 0.95297    test_acc = 0.92730
19 epoch  train_acc = 0.95308    test_acc = 0.92620
20 epoch  train_acc = 0.94798    test_acc = 0.92350
elapsed_time:413.4318935871124[sec]

mxnet-cu92mklの優位性は認めなかった。
むしろ遅い結果となってしまった。