我们有一个有10个节点的点火集群。(每个节点有12个CPU 62 GB内存。)有时其中一个点火节点损坏。当我们检查点火日志时,我们看不到任何错误日志、警告或解释。在单个节点消失后不久,第二个节点消失,我们开始得到一个分区丢失错误。作为调查的结果,我们看到第一个传出节点在服务器日志中收到了SIGSEGV。
Apr 19 07:12:25 tr-ignite-6 service.sh[478792]: # SIGSEGV (0xb) at pc=0x00007f97110526d0, pid=478876, tid=0x00007f9675dc6700
Apr 19 08:02:22 tr-ignite-9 service.sh[494185]: # SIGSEGV (0xb) at pc=0x00007fbe71052854, pid=494269, tid=0x00007fbdd9f0e700
Apr 19 08:02:35 tr-ignite-12 service.sh[387860]: # SIGSEGV (0xb) at pc=0x00007f2501052840, pid=387944, tid=0x00007f2475e77700
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # A fatal error has been detected by the Java Runtime Environment:
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: #
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # SIGSEGV (0xb) at pc=0x00007efccc77e6ee, pid=1607467, tid=0x00007efc4c43b700
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: #
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # JRE version: Java(TM) SE Runtime Environment (8.0_281-b09) (build 1.8.0_281-b09)
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.281-b09 mixed mode linux-amd64 compressed oops)
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # Problematic frame:
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # v ~StubRoutines::jbyte_disjoint_arraycopy
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: #
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # Core dump written. Default location: /var/lib/apache-ignite/core or core.1607467
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: #
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # An error report file with more information is saved as:
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # /var/lib/apache-ignite/hs_err_pid1607467.log
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: Compiled method (c2) 1403659804 15609 4 org.apache.ignite.internal.binary.BinaryWriterExImpl::write (13 bytes)
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: total in heap [0x00007efccf1c29d0,0x00007efccf1c3138] = 1896
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: relocation [0x00007efccf1c2af8,0x00007efccf1c2b40] = 72
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: main code [0x00007efccf1c2b40,0x00007efccf1c2e20] = 736
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: stub code [0x00007efccf1c2e20,0x00007efccf1c2e58] = 56
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: oops [0x00007efccf1c2e58,0x00007efccf1c2e60] = 8
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: metadata [0x00007efccf1c2e60,0x00007efccf1c2ea8] = 72
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: scopes data [0x00007efccf1c2ea8,0x00007efccf1c3038] = 400
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: scopes pcs [0x00007efccf1c3038,0x00007efccf1c30d8] = 160
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: dependencies [0x00007efccf1c30d8,0x00007efccf1c30e8] = 16
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: handler table [0x00007efccf1c30e8,0x00007efccf1c3118] = 48
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: nul chk table [0x00007efccf1c3118,0x00007efccf1c3138] = 32
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: #
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # If you would like to submit a bug report, please visit:
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: # http://bugreport.java.com/bugreport/crash.jsp
Mar 24 08:28:02 tr-ignite-2 service.sh[1607384]: #
我们看了看,但找不到任何原因。为什么我们会收到SIGSEGV?