构建通过队列用gpu跑模型的服务


简要

这是一个实验记录,用GPU搭建flask服务器的资源释放优化过程

跑cpu的模型

服务中跑模型训练时,从主进程分出子进程运行模型训练代码

import multiprocessing as mp


while True:
    _param = self.q.get()
    model_prefix = _param["appid"]
    p = mp.Process(
        target=self.sync_train,
        name=model_prefix,
        args=(
            _param["appid"],
            _param["data"],
            _param["config"],
            _param["model_train"],
        ),
    )
    p.start()

运行结束后通过杀死该进程回收资源

killRs = os.kill(pid, signal.SIGKILL)

跑gpu的模型

直接套用后报错:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

查看pytorch官方的解释

The CUDA runtime does not support the fork start method; either the spawn or forkserver start method are required to use CUDA in subprocesses.

查看python官方文档关于multiprocessing的fork、spawn、forkserver启动模式的解释

> Depending on the platform, multiprocessing supports three ways to start a process. These start methods are
>    spawn:
>        The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.
>        Available on Unix and Windows. The default on Windows.
>    fork:
>        The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
>        Available on Unix only. The default on Unix.
>    forkserver:
>        When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
>        Available on Unix platforms which support passing file descriptors over Unix pipes.

由此可知Unix系统的multiprocessing默认采用fork模式,想要在子进程中启动CUDA,必须改成spawn、forkserver模式。

在启动服务时,加入代码,设置spawn模式:

if __name__ == "__main__":
    set_start_method('spawn')
    app.run(host="0.0.0.0", port=10001, debug=False, threaded=False)

发现启动服务时,报错:

RuntimeError: context has already been set

查资料后,在set_start_method中加入force=True:

if __name__ == "__main__":
    set_start_method('spawn', force=True)
    app.run(host="0.0.0.0", port=10001, debug=False, threaded=False)

此时服务生成子进程时又报错

typeerror: can't pickle _thread.lock objects process

查资料说生成子进程时共享变量不能序列化,这是一个python多进程时常见的问题。

有的帖子说从3.8退回3.7解决了pickle问题

于是我尝试了python3.6、3.7、3.8都不能解决问题

考虑到服务代码中,传入子进程中的变量一定要包含自定义的对象,所以没办法,放弃了服务用子进程跑gpu的方式。

改成子线程启动模型训练

从主进程中分出来线程训练模型,把multiprocessing.Process改为threading.Thread。

考虑到GPU资源问题,一次只能训练一个模型,所以在后面加上p.join()

import threading


while True:
    _param = self.q.get()
    model_prefix = _param["appid"]
    p = threading.Thread(
        target=self.sync_train,
        name=model_prefix,
        args=(
            _param["appid"],
            _param["data"],
            _param["config"],
            _param["model_train"],
        ),
    )
    p.start()
    p.join()

但是如果按照刚才的思路,当训练结束后,用kill进程的方式,回收资源,那么杀掉的是主进程,即整个服务就被kill了

  • 尝试把kill进程改为kill线程

很多资料不建议杀线程

In general, killing threads abruptly is considered a bad programming practice. Killing a thread abruptly might leave a critical resource that must be closed properly, open.

  • 寻找pytorch的命令直接释放显卡资源

pytorch清显存官方文档

torch.cuda.empty_cache()

在服务的启动的线程中,模型训练结束后运行该命令后发现:

nvidia-smi中资源并没有全部释放,从训练时的7000MiB,下降至3000MiB

继续查资料后发现这篇帖子说,显存已经清零了,只是在nvidia-smi中没有显示清零而已。

于是半信半疑的让该服务执行了很多次训练模型,果然每次训练时,占用内存都是固定的。

该命令的确实际清零了显存。


文章作者: Lowin Li
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Lowin Li !
评论
 上一篇
一个中文摘要生成开源项目 一个中文摘要生成开源项目
项目地址 提供一款中文版生成式摘要服务。 提供从数据到训练到部署,完整流程参考。初衷由于工作需要,在开源社区寻找汉语生成摘要模型时,几乎找不到可用的开源项目。 本项目在英文生成式摘要开源项目Text-Summarizer-Pytorch
2020-11-15
下一篇 
月下绿皮 月下绿皮
十五的月亮十六圆,而我于正月十七乘坐绿皮离开这片生我养我的黑土地,想必,这一别又是一年。记忆中的月亮,既大且圆,轮廓清晰,纯净澄黄,正如此刻。月光撒在白凯凯的积雪上,照亮了整个世界,此刻便想跳入这片雪白的海洋,亲吻整个夜晚。白色的杨树投下斑
2014-02-17
  目录