服务端开启RPC服务，因为zmq模块导致服务端程序不定期随机崩溃的情况又出现了，有什么排查思路吗？ - 主题

人称仲哥

Member

加入于: 2019年1月18日

帖子: 24

声望: 2

2020年3月30日 06:38:27 UTC (由人称仲哥最后编辑：2020年3月30日 15:11:13 UTC)

之前有出现过这个问题，后来可能有三四个月都没出现过，详情见之前发的贴子：
https://www.vnpy.com/forum/topic/1732-kai-qi-rpcfu-wu-de-fu-wu-duan-cheng-xu-bu-ding-shi-beng-kui-fu-windowsshi-jian-ri-zhi

最近又开始出现了三四次这个问题，报错提示和之前差不多，都是由于zmq模块导致的。
我在google上进行了搜索，也去pyzmq的github仓库逛了逛，没有找到什么有用的信息。

关键是没办法重现这个问题，有什么排查思路可以借鉴一下吗？或者有什么方法可以让这个问题重现

===========

无意中在官方文档搜索到一段话：
In ØMQ, Contexts are threadsafe objects, but Sockets are not. It is safe to use a single Context (e.g. via zmq.Context.instance()) in your entire multithreaded application, but you should create sockets on a per-thread basis. If you share sockets across threads, you are likely to encounter uncatchable c-level crashes of your application unless you use judicious application of threading.Lock, but this approach is not recommended.

会和这个多线程共用socket有关吗？

=============
有没有什么办法可以打印出pyzmq底层模块libzmq的出错信息？？

用Python的交易员

Administrator

加入于: 2018年12月10日

帖子: 4583

声望: 337

2020年3月31日 01:57:51 UTC

请问你用的是什么版本的vn.py？后面rpc模块升级过几次，加上了调用锁，应该一定程度上能解决问题

人称仲哥

Member

加入于: 2019年1月18日

帖子: 24

声望: 2

2020年3月31日 02:01:15 UTC (由人称仲哥最后编辑：2020年3月31日 11:04:10 UTC)

用Python的交易员 wrote:

请问你用的是什么版本的vn.py？后面rpc模块升级过几次，加上了调用锁，应该一定程度上能解决问题

我用的还是比较早的2.0.7，因为项目已经在做模拟盘稳定性测试，所以就没有升级到最新版。那我去GitHub看看相关的commit。

刚刚去仓库看了下，2.0.7的版本已经是客户端有添加了线程锁，2.0.8/2.0.9/2.1.0都没什么大变化，2.1.1添加了互联网加密调用功能.

==============

RpcServer的心跳的publish是由Server本身的threading.Thread(target=self.run)来运行的，而RpcService向客户端推送交易事件（委托、成交、持仓等）是由EventEngine的Thread(target=self._run)来运行的，这样的用法是否会造成文档说的share sockets across threads？？
那是否有需要对RpcServer的publish也加线程锁？？还有context的用法改成像官方文档zmq.Context.instance()这样是否有用呢？？

人称仲哥

Member

加入于: 2019年1月18日

帖子: 24

声望: 2

2020年3月31日 14:05:01 UTC (由人称仲哥最后编辑：2020年4月1日 01:12:32 UTC)

看了stackoverflow的文章还有zmq的官方文档，因为水平有限，一知半解，摘抄出来探讨一下。

已知，zmq的socket是不能在多个线程之间共用的，可能会随机产生问题，使用锁、互斥似乎也不是好主意
虽然现在还不清楚随机崩溃的问题是否是因为线程共用socket引起的，但是目前的RpcSever代码似乎确实是存在共享socket的问题的。

引用如下（各段落摘抄自不同文章）：

You MUST NOT share ØMQ sockets between threads. ØMQ sockets are not threadsafe. Technically it's possible to do this, but it demands semaphores, locks, or mutexes. This will make your application slow and fragile. The only place where it's remotely sane to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.

If you need to start more than one proxy in an application, for example, you will want to run each in their own thread. It is easy to make the error of creating the proxy frontend and backend sockets in one thread, and then passing the sockets to the proxy in another thread. This may appear to work at first but will fail randomly in real use.Remember: Do not use or close sockets except in the thread that created them.

You can create lots of 0MQ sockets, certainly as many as you have threads. If you create a socket in one thread, and use it in another, you must execute a full memory barrier between the two operations. Anything else will result in weird random failures in libzmq, as socket objects are not threadsafe.

If you’re sharing sockets across threads, don’t. It will lead to random weirdness, and crashes.

===================================================
官方多线程使用zmq的建议：
http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ
范例代码：
http://zguide.zeromq.org/py:mtserver

===================================================
stackoverflow的一个回答
出处：https://stackoverflow.com/questions/5841896/0mq-how-to-use-zeromq-in-a-threadsafe-manner

There are a few conventional patterns, though I don't know how these map specifically to .NET:

Create sockets in the threads that use them, period. Share contexts between threads that are tightly bound into one process, and create separate contents in threads that are not tightly bound. In the high-level C API (czmq) these are called attached and detached threads.

Create a socket in a parent thread and pass at thread creation time to an attached thread. The thread creation call will execute a full memory barrier. From then on, use the socket only in the child thread. "use" means recv, send, setsockopt, getsockopt, and close.

Create a socket in one thread, and use in another, executing your own full memory barrier between each use. This is extremely delicate and if you don't know what a "full memory barrier" is, you should not be doing this.

因为并不是这方面的专家，有些也是看得一知半解，所以根据上面的经验，抛出以下思路，想请问是否可行？

第1条思路：
不是很懂

第2条思路
目前RpcSever的socket(zmq.REP)似乎是符合上面第二条的模式：即在父线程创建socket，在父进程创建子线程（这样可以形成一个内存屏障）。rep-socket只在这个子线程里面使用这个socket，但是socket(zmq.PUB)并不符合，因为这个socket不仅在RpcSever的子线程里面使用了，也在EventEngine的子线程里使用了，那能否把pub的socket全部移到RpcServer的子线程运行或者全部移到EventEngine的子线程内运行呢？

第3条思路
自己创建完全的内存屏障，似乎有点复杂。

另外推测的思路：
1、能否在RpcServer中的publish函数直接创建新的socket，比如with context.socket(zmq.PUB) as socket:，但是这样应该会频繁创建销毁socket，对性能是否会有很大的影响？
2、用Context().instance()创建全局上下文，能否在publish心跳的进程中使用一个固定的pub-socket，然后在publish交易事件中使用另外一个pub-socket？

人称仲哥

Member

加入于: 2019年1月18日

帖子: 24

声望: 2

2020年3月31日 14:53:18 UTC (由人称仲哥最后编辑：2020年3月31日 14:54:05 UTC)

我之前有怀疑是否是因为我写的上层应用代码存在bug才导致的崩溃。
不过今晚，已经确定和我自己写的代码无关。

我在同一台电脑上运行两个程序，服务端仅运行ctp接口和RpcSevice，客户端只运行RpcGateway，这样客户端完全没有运行任何我自己写的代码，但是服务端还是出现了崩溃的问题，所以可以确定与我自己的代码应该是无关的。

使用标准的连接RpcGateway仍然会出问题，不过这次window的事件查看器有不一样的报错：

目前已出现的报错有三种，事件查看器报错提示如下：

ntdll.dll

错误应用程序名称: python.exe，版本: 3.7.1150.1013，时间戳: 0x5bcb42b2
错误模块名称: ntdll.dll，版本: 10.0.18362.719，时间戳: 0x64d10ee0
异常代码: 0xc0000374
错误偏移量: 0x00000000000f92a9
错误进程 ID: 0xc4
错误应用程序启动时间: 0x01d60752783b355d
错误应用程序路径: C:\vnstudio\python.exe
错误模块路径: C:\WINDOWS\SYSTEM32\ntdll.dll
报告 ID: 46f0b9c1-957f-48cd-910c-5b54bdc31c40

libzmq.cp37-win_amd64.pyd

错误应用程序名称: python.exe，版本: 3.7.1150.1013，时间戳: 0x5bcb42b2
错误模块名称: libzmq.cp37-win_amd64.pyd，版本: 0.0.0.0，时间戳: 0x5d14cb87
异常代码: 0xc0000005
错误偏移量: 0x000000000002e0a6
错误进程 ID: 0x19c0
错误应用程序启动时间: 0x01d5fcbe0d367e6f
错误应用程序路径: C:\vnstudio\python.exe
错误模块路径: C:\vnstudio\lib\site-packages\zmq\libzmq.cp37-win_amd64.pyd
报告 ID: d9bce774-a463-44bb-89a1-c46420a8d8b8

VCRUNTIME140.dll

错误应用程序名称: python.exe，版本: 3.7.1150.1013，时间戳: 0x5bcb42b2
错误模块名称: VCRUNTIME140.dll，版本: 14.12.25810.0，时间戳: 0x59dd7da2
异常代码: 0xc0000005
错误偏移量: 0x000000000000caa7
错误进程 ID: 0x3534
错误应用程序启动时间: 0x01d5fcbdebf10185
错误应用程序路径: C:\vnstudio\python.exe
错误模块路径: C:\vnstudio\VCRUNTIME140.dll
报告 ID: b2560516-da32-43c2-8746-c0baa2361663

人称仲哥

Member

加入于: 2019年1月18日

帖子: 24

声望: 2

2020年4月1日 01:44:14 UTC (由人称仲哥最后编辑：2020年4月1日 01:58:41 UTC)

还有一个问题，keep_alive逻辑是不是存在一个bug？

就是按我的理解，KEEP_ALIVE_INTERVAL是控制发心跳的时间间隔的，默认是1秒。
然后下面的代码应该是要控制每秒钟publish1次，如果距离上次间隔不足1秒就不做处理。

description

但是可以看到，start整个变量一旦运行就固定了，只有cur一直在变，但是因为start固定了，只要运行超过1秒以后，delta >= KEEP_ALIVE_INTERVAL这个条件一直都是True，所以超过1秒后的if相当于没有作用了。看cmd的print，可以发现delta变量一直在增长，publish并不是间隔1秒运行一次。

猜测：是不是调用publish函数之后，要把start再重置为本次调用后的时间，即在self.publish(KEEP_ALIVE_TOPIC, cur)代码之后添加 start = datetime.utcnow()，这样下一次循环再比较就能起到判断间隔的作用了。
description

另外，从上面也可以看出，确实有2个线程在使用pub-socket

人称仲哥

Member

加入于: 2019年1月18日

帖子: 24

声望: 2

2020年4月3日 06:56:00 UTC

可能的解决办法：

既然猜测可能是由于多线程共用zmq的socket引起的问题，我就着手加大zmq的publish_socket的负载，首先加大keep_alive的发送频率，一次性发送它个10w条，然后大量订阅期货合约，这样让两个调用pub-socket的线程的工作负载加大，这样理论上可以提高出错的概率。
通过这个办法，成功让崩溃的问题重现，如果不订阅合约或少量合约的话，可能要比较久才崩溃，订阅大量合约后崩溃时间从2分钟到20分钟都有，一般不会超过20分钟。
然后又做了两个实验，一个把publish函数全部移到事件引擎的处理事件线程，这样只有一个线程在用pub-socket。另一个实验，对publish加线程锁。使用这两个方法之后，其他环境不变的情况下，Rpc服务全天都没有崩溃。

为什么说是可能的解决办法，因为是初步推断问题在这里，但是还需要进行更长期的测试。

用Python的交易员

Administrator

加入于: 2018年12月10日

帖子: 4583

声望: 337

2020年4月5日 01:59:14 UTC

非常感谢，我们这边也来查下具体原因，确实可能还是锁的问题...

用Python的交易员

Administrator

加入于: 2018年12月10日

帖子: 4583

声望: 337

2020年4月10日 03:28:30 UTC

已在DEV分支在publish函数中添加了多线程锁，避免再出现资源冲突导致的崩溃。

Member

Administrator

Member

Member

Member

Member

Member

Administrator

Administrator

您确定?