PyTorch中的C++擴展實現

發布時間：2020-10-24 19:03:39 來源：腳本之家閱讀：328 作者：jermmyhsu 欄目：開發技術

今天要聊聊用 PyTorch 進行 C++ 擴展。

在正式開始前，我們需要了解 PyTorch 如何自定義module。這其中，最常見的就是在 python 中繼承torch.nn.Module，用 PyTorch 中已有的 operator 來組裝成自己的模塊。這種方式實現簡單，但是，計算效率卻未必最佳，另外，如果我們想實現的功能過于復雜，可能 PyTorch 中那些已有的函數也沒法滿足我們的要求。這時，用 C、C++、CUDA 來擴展 PyTorch 的模塊就是最佳的選擇了。

由于目前市面上大部分深度學習系統（TensorFlow、PyTorch 等）都是基于 C、C++ 構建的后端，因此這些系統基本都存在 C、C++ 的擴展接口。PyTorch 是基于 Torch 構建的，而 Torch 底層采用的是 C 語言，因此 PyTorch 天生就和 C 兼容，因此用 C 來擴展 PyTorch 并非難事。而隨著 PyTorch2.0 的發布，官方已經開始考慮將 PyTorch 的底層代碼用 caffe2 替換，因此他們也在逐步重構 ATen，后者是目前 PyTorch 使用的 C++ 擴展庫。總的來說，C++ 是未來的趨勢。至于 CUDA，這是幾乎所有深度學習系統在構建之初就采用的工具，因此 CUDA 的擴展接口是標配。

本文用一個簡單的例子，梳理一下進行 C++ 擴展的步驟，至于一些具體的實現，不做深入探討。

PyTorch的C、C++、CUDA擴展

關于 PyTorch 的 C 擴展，可以參考官方教程或者這篇博文，其操作并不難，無非是借助原先 Torch 提供的<TH/TH.h>和<THC/THC.h>等接口，再利用 PyTorch 中提供的torch.util.ffi模塊進行擴展。需要注意的是，隨著 PyTorch 版本升級，這種做法在新版本的 PyTorch 中可能會失效。

本文主要介紹 C++（未來可能加上 CUDA）的擴展方法。

C++擴展

首先，介紹一下基本流程。在 PyTorch 中擴展 C++/CUDA 主要分為幾步：

安裝好 pybind11 模塊（通過 pip 或者 conda 等安裝），這個模塊會負責 python 和 C++ 之間的綁定；
用 C++ 寫好自定義層的功能，包括前向傳播forward和反向傳播backward；
寫好 setup.py，并用 python 提供的setuptools來編譯并加載 C++ 代碼。
編譯安裝，在 python 中調用 C++ 擴展接口。

接下來，我們就用一個簡單的例子（z=2x+y）來演示這幾個步驟。

第一步

安裝 pybind11 比較簡單，直接略過。我們先寫好 C++ 相關的文件：

頭文件 test.h

#include <torch/extension.h>
#include <vector>

// 前向傳播
torch::Tensor Test_forward_cpu(const torch::Tensor& inputA,
              const torch::Tensor& inputB);
// 反向傳播
std::vector<torch::Tensor> Test_backward_cpu(const torch::Tensor& gradOutput);

注意，這里引用的<torch/extension.h>頭文件至關重要，它主要包括三個重要模塊：

pybind11，用于 C++ 和 python 交互；
ATen，包含 Tensor 等重要的函數和類；
一些輔助的頭文件，用于實現 ATen 和 pybind11 之間的交互。

源文件 test.cpp 如下：

#include "test.h"

// 前向傳播，兩個 Tensor 相加。這里只關注 C++ 擴展的流程，具體實現不深入探討。
torch::Tensor Test_forward_cpu(const torch::Tensor& x,
              const torch::Tensor& y) {
  AT_ASSERTM(x.sizes() == y.sizes(), "x must be the same size as y");
  torch::Tensor z = torch::zeros(x.sizes());
  z = 2 * x + y;
  return z;
}

// 反向傳播
// 在這個例子中，z對x的導數是2，z對y的導數是1。
// 至于這個backward函數的接口（參數，返回值）為何要這樣設計，后面會講。
std::vector<torch::Tensor> Test_backward_cpu(const torch::Tensor& gradOutput) {
  torch::Tensor gradOutputX = 2 * gradOutput * torch::ones(gradOutput.sizes());
  torch::Tensor gradOutputY = gradOutput * torch::ones(gradOutput.sizes());
  return {gradOutputX, gradOutputY};
}

// pybind11 綁定
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
 m.def("forward", &Test_forward_cpu, "TEST forward");
 m.def("backward", &Test_backward_cpu, "TEST backward");
}

第二步

新建一個編譯安裝的配置文件 setup.py，文件目錄安排如下：

└── csrc
  ├── cpu
  │  ├── test.cpp
  │  └── test.h
  └── setup.py

以下是 setup.py 中的內容：

from setuptools import setup
import os
import glob
from torch.utils.cpp_extension import BuildExtension, CppExtension

# 頭文件目錄
include_dirs = os.path.dirname(os.path.abspath(__file__))
# 源代碼目錄
source_cpu = glob.glob(os.path.join(include_dirs, 'cpu', '*.cpp'))

setup(
  name='test_cpp', # 模塊名稱，需要在python中調用
  version="0.1",
  ext_modules=[
    CppExtension('test_cpp', sources=source_cpu, include_dirs=[include_dirs]),
  ],
  cmdclass={
    'build_ext': BuildExtension
  }
)

注意，這個 C++ 擴展被命名為test_cpp，意思是說，在 python 中可以通過test_cpp模塊來調用 C++ 函數。

第三步

在 cpu 這個目錄下，執行下面的命令編譯安裝 C++ 代碼：

python setup.py install

之后，可以看到一堆輸出，該 C++ 模塊會被安裝在 python 的 site-packages 中。

完成上面幾步后，就可以在 python 中調用 C++ 代碼了。在 PyTorch 中，按照慣例需要先把 C++ 中的前向傳播和反向傳播封裝成一個函數op（以下代碼放在 test.py 文件中）：

from torch.autograd import Function

import test_cpp

class TestFunction(Function):

  @staticmethod
  def forward(ctx, x, y):
    return test_cpp.forward(x, y)

  @staticmethod
  def backward(ctx, gradOutput):
    gradX, gradY = test_cpp.backward(gradOutput)
    return gradX, gradY

這樣一來，我們相當于把 C++ 擴展的函數嵌入到 PyTorch 自己的框架內。

我查看了這個Function類的代碼，發現是個挺有意思的東西：

class Function(with_metaclass(FunctionMeta, _C._FunctionBase, _ContextMethodMixin, _HookMixin)):
 
  ...

  @staticmethod
  def forward(ctx, *args, **kwargs):
    r"""Performs the operation.

    This function is to be overridden by all subclasses.

    It must accept a context ctx as the first argument, followed by any
    number of arguments (tensors or other types).

    The context can be used to store tensors that can be then retrieved
    during the backward pass.
    """
    raise NotImplementedError

  @staticmethod
  def backward(ctx, *grad_outputs):
    r"""Defines a formula for differentiating the operation.

    This function is to be overridden by all subclasses.

    It must accept a context :attr:`ctx` as the first argument, followed by
    as many outputs did :func:`forward` return, and it should return as many
    tensors, as there were inputs to :func:`forward`. Each argument is the
    gradient w.r.t the given output, and each returned value should be the
    gradient w.r.t. the corresponding input.

    The context can be used to retrieve tensors saved during the forward
    pass. It also has an attribute :attr:`ctx.needs_input_grad` as a tuple
    of booleans representing whether each input needs gradient. E.g.,
    :func:`backward` will have ``ctx.needs_input_grad[0] = True`` if the
    first input to :func:`forward` needs gradient computated w.r.t. the
    output.
    """
    raise NotImplementedError

這里需要注意一下backward的實現規則。該接口包含兩個參數：ctx是一個輔助的環境變量，grad_outputs則是來自前一層網絡的梯度列表，而且這個梯度列表的數量與forward函數返回的參數數量相同，這也符合鏈式法則的原理，因為鏈式法則就需要把前一層中所有相關的梯度與當前層進行相乘或相加。同時，backward需要返回forward中每個輸入參數的梯度，如果forward中包括 n 個參數，就需要一一返回 n 個梯度。所以，在上面這個例子中，我們的backward函數接收一個參數作為輸入（forward只輸出一個變量），并返回兩個梯度（forward接收上一層兩個輸入變量）。

定義完Function后，就可以在Module中使用這個自定義op了：

import torch

class Test(torch.nn.Module):

  def __init__(self):
    super(Test, self).__init__()

  def forward(self, inputA, inputB):
    return TestFunction.apply(inputA, inputB)

現在，我們的文件目錄變成：

├── csrc
│  ├── cpu
│  │  ├── test.cpp
│  │  └── test.h
│  └── setup.py
└── test.py

之后，我們就可以將 test.py 當作一般的 PyTorch 模塊進行調用了。

測試

下面，我們測試一下前向傳播和反向傳播：

import torch
from torch.autograd import Variable

from test import Test

x = Variable(torch.Tensor([1,2,3]), requires_grad=True)
y = Variable(torch.Tensor([4,5,6]), requires_grad=True)
test = Test()
z = test(x, y)
z.sum().backward()
print('x: ', x)
print('y: ', y)
print('z: ', z)
print('x.grad: ', x.grad)
print('y.grad: ', y.grad)

輸出如下：

x: tensor([1., 2., 3.], requires_grad=True)
y: tensor([4., 5., 6.], requires_grad=True)
z: tensor([ 6., 9., 12.], grad_fn=<TestFunctionBackward>)
x.grad: tensor([2., 2., 2.])
y.grad: tensor([1., 1., 1.])

可以看出，前向傳播滿足 z=2x+y，而反向傳播的結果也在意料之中。

CUDA擴展

雖然 C++ 寫的代碼可以直接跑在 GPU 上，但它的性能還是比不上直接用 CUDA 編寫的代碼，畢竟 ATen 沒法并不知道如何去優化算法的性能。不過，由于我對 CUDA 仍一竅不通，因此這一步只能暫時略過，留待之后補充～囧～。

參考

CUSTOM C EXTENSIONS FOR PYTORCH
CUSTOM C++ AND CUDA EXTENSIONS
Pytorch拓展進階(一)：Pytorch結合C以及Cuda語言
Pytorch拓展進階(二)：Pytorch結合C++以及Cuda拓展

到此這篇關于PyTorch中的C++擴展實現的文章就介紹到這了,更多相關PyTorch C++擴展內容請搜索億速云以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持億速云！

向AI問一下細節

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

PyTorch中的C++擴展實現

猜你喜歡

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

PyTorch中的C++擴展實現

猜你喜歡

最新資訊

相關推薦

相關標簽