PyCUDAによる並列計算

September 18, 2024

概要

とある計算にモンテカルロシミュレーションを用いており、CPUで処理していたのですが、あまりに低速だったので、GPUによる並列化を行い、処理時間を短縮化させました。
サンプルコードを載せました。お手持ちのPCにGPUが搭載されていない場合、Google Colabならオンライン上で無料でGPUが利用可能です。

GPU

Google Colabで利用可能なGPUです。

T4の性能は、1世代前のGPU RTX2080なんかよりも劣りますが、CPUで計算するよりは遥かに高速です。
画像系のAIとかでメモリが沢山必要とかじゃない限りはT4で十分だと思います。
私のPCには、RTX 2080とRTX 2080 Superが挿さっているんですが、プライベートだとkaggleやってた時は使ってました。
ただ、もったいないことに最近は全く使ってないです。
将来的に強化学習やりたいと思っているので、近々また使うと思います。　　

PyCUDA

PyCUDAは、その名の通りPythonからCUDAを使うためのラッパーです。
同じようなやつだとPyOpenCLがあります。
今回はPyCUDA使ってます。

インストール

Google Colabにないものをインストールします。

!pip install pycuda
!pip install matplotlib==3.7.1
!wget https://raw.githubusercontent.com/googlefonts/noto-cjk/main/Sans/OTF/Japanese/NotoSansCJKjp-Medium.otf

サンプル1:円周率の計算

Wikipediaにも記載がありますが、考え方としては、ランダムに大量の点を打って、円の中に含まれる点の数をカウントして円周率を近似します。
ただ、このやり方は効率が良くないので、桁数記録なんかはラマヌジャン型の公式なんかを使ってます。
GPUプログラミン特有で、ブロック数とグリッド数を決めないといけないです。
サンプルだとapproximate_pi_kernelの引数で指定してます。
詳しくは、GPUプログラミング・基礎編等をご参照ください。

ソースコード

'''
Copyright (C) 2024 Ryo M.

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.
'''

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
import time

cuda_code = """
__device__ unsigned int lcg_rand(unsigned int *state)
{
    *state = 1664525u * (*state) + 1013904223u;
    return *state;
}

__device__ float lcg_uniform(unsigned int *state)
{
    return (float)(lcg_rand(state) & 0xFFFFFF) / (float)0x1000000;
}

__global__ void approximate_pi(unsigned int *count, unsigned int n, unsigned int seed)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    unsigned int state = seed + idx;

    unsigned int local_count = 0;
    for (unsigned int i = idx; i < n; i += stride) {
        float x = lcg_uniform(&state);
        float y = lcg_uniform(&state);
        if (x*x + y*y <= 1.0f) {
            local_count++;
        }
    }

    atomicAdd(count, local_count);
}
"""

# CUDAカーネルのコンパイル
mod = SourceModule(cuda_code)
approximate_pi_kernel = mod.get_function("approximate_pi")

def gpu_approximate_pi(n_points):
    # GPUメモリの確保と初期化
    count_gpu = cuda.mem_alloc(4)  # unsigned int = 4 bytes
    cuda.memset_d32(count_gpu, 0, 1)

    # グリッドとブロックの設定
    block_size = 256
    grid_size = (n_points + block_size - 1) // block_size

    # カーネルの実行
    start_time = time.time()
    approximate_pi_kernel(
        count_gpu,
        np.uint32(n_points),
        np.uint32(np.random.randint(1000000)),
        block=(block_size, 1, 1),
        grid=(grid_size, 1)
    )

    count = np.zeros(1, dtype=np.uint32)
    cuda.memcpy_dtoh(count, count_gpu)
    pi_approx = 4.0 * count[0] / n_points

    end_time = time.time()

    return pi_approx, end_time - start_time

if __name__ == '__main__':
    n_points = 1000000

    pi_approx, execution_time = gpu_approximate_pi(n_points)

    print(f"点の数: {n_points:,}")
    print(f"近似された円周率: {pi_approx:.10f}")
    print(f"実際の円周率:     {np.pi:.10f}")
    print(f"誤差(%):               {abs(pi_approx - np.pi)/np.pi*100:.3f}")
    print(f"実行時間:         {execution_time:.4f} 秒")

計算結果

点の数: 1,000,000
近似された円周率: 3.1415760000
実際の円周率:     3.1415926536
誤差(%):               0.001
実行時間:         0.0004 秒

サンプル2: 逆正弦法則

通称、運の法則。
ざっくり説明すると、確率二分の一の二者択一の選択を繰り返すことを大人数でやったら、バカみたいにツキがある人の集団とバカみたいにツキがない人の集団に分かれるっていう法則です。
普通に考えたら、中間のプラマイゼロな結果になる人が一番多そうですが、そうはならないです。

ソースコード

'''
Copyright (C) 2024 Ryo M.

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.
'''

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import arcsine, probplot
from scipy import stats

cuda_code = """
__device__ unsigned int lcg_rand(unsigned int *state)
{
    *state = 1664525u * (*state) + 1013904223u;
    return *state;
}

__device__ float lcg_uniform(unsigned int *state)
{
    return (float)(lcg_rand(state) & 0xFFFFFF) / (float)0x1000000;
}

__global__ void simulate_random_walk(float *results, int num_steps, int num_simulations, unsigned int seed)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < num_simulations) {
        unsigned int state = seed + idx;
        float position = 0.0f;
        int positive_time = 0;

        for (int i = 0; i < num_steps; i++) {
            if (lcg_uniform(&state) < 0.5f) {
                position += 1.0f;
            } else {
                position -= 1.0f;
            }
            if (position > 0) {
                positive_time++;
            }
        }

        results[idx] = (float)positive_time / num_steps;
    }
}
"""

mod = SourceModule(cuda_code)
simulate_random_walk_kernel = mod.get_function("simulate_random_walk")

def gpu_simulate_random_walk(num_steps, num_simulations):
    results_gpu = cuda.mem_alloc(num_simulations * 4)  # float = 4 bytes
    block_size = 256
    grid_size = (num_simulations + block_size - 1) // block_size

    simulate_random_walk_kernel(
        results_gpu,
        np.int32(num_steps),
        np.int32(num_simulations),
        np.uint32(np.random.randint(1000000)),
        block=(block_size, 1, 1),
        grid=(grid_size, 1)
    )

    results = np.zeros(num_simulations, dtype=np.float32)
    cuda.memcpy_dtoh(results, results_gpu)
    return results

# シミュレーションパラメータ
num_steps = 10000  # ランダムウォークのステップ数
num_simulations = 1000000  # シミュレーション回数

# シミュレーション実行
results = gpu_simulate_random_walk(num_steps, num_simulations)

# プロットの設定
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 18))

# PDF（確率密度関数）のプロット
num_bins = 100
hist, bin_edges, _ = ax1.hist(results, bins=num_bins, density=True, alpha=0.7, label='シミュレーション結果')
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2

x = np.linspace(0, 1, 1000)
theoretical_pdf = arcsine.pdf(x)
ax1.plot(x, theoretical_pdf, 'r-', lw=2, label='理論的な逆正弦分布')

ax1.set_title(f'逆正弦法則のシミュレーション - PDF (Bins: {num_bins})')
ax1.set_xlabel('正の領域に滞在した時間の割合')
ax1.set_ylabel('確率密度')
ax1.legend()

# 残差プロット
ax1_residual = ax1.twinx()
interpolated_pdf = np.interp(bin_centers, x, theoretical_pdf)
residuals = hist - interpolated_pdf
ax1_residual.plot(bin_centers, residuals, 'g-', alpha=0.5)
ax1_residual.set_ylabel('残差 (シミュレーション - 理論)', color='g')
ax1_residual.tick_params(axis='y', labelcolor='g')

# CDF（累積分布関数）のプロット
sim_cdf = np.cumsum(hist) / np.sum(hist)
ax2.plot(bin_centers, sim_cdf, label='シミュレーション結果')
ax2.plot(x, arcsine.cdf(x), 'r-', lw=2, label='理論的な逆正弦分布')

ax2.set_title('逆正弦法則のシミュレーション - CDF')
ax2.set_xlabel('正の領域に滞在した時間の割合')
ax2.set_ylabel('累積確率')
ax2.legend()

# Q-Qプロット
probplot(results, dist=arcsine, plot=ax3)
ax3.set_title('Q-Qプロット (シミュレーション vs 理論的な逆正弦分布)')

plt.tight_layout()
plt.show()

# 統計情報の表示
print(f"シミュレーション回数: {num_simulations}")
print(f"ランダムウォークのステップ数: {num_steps}")
print(f"平均値: {np.mean(results):.4f}")
print(f"標準偏差: {np.std(results):.4f}")
print(f"理論的な平均値: 0.5")
print(f"理論的な標準偏差: {1 / (2 * np.sqrt(3)):.4f}")

# Kolmogorov-Smirnov検定
ks_statistic, p_value = stats.kstest(results, 'arcsine')
print(f"\nKolmogorov-Smirnov検定:")
print(f"KS統計量: {ks_statistic:.4f}")
print(f"p値: {p_value:.4f}")

# 分位数の比較
quantiles = [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]
sim_quantiles = np.quantile(results, quantiles)
theo_quantiles = arcsine.ppf(quantiles)

print("\n分位数の比較 (シミュレーション vs 理論):")
for q, sq, tq in zip(quantiles, sim_quantiles, theo_quantiles):
    print(f"{q*100:3.0f}%分位数: {sq:.4f} vs {tq:.4f}, 差: {sq-tq:.4f}")

計算結果

各自、実行してみてください。

PyCUDAによる並列計算

目次

概要

GPU

PyCUDA

インストール

サンプル1:円周率の計算

サンプル2: 逆正弦法則

その他参考

Please share it if you like!