代码之家  ›  专栏  ›  技术社区  ›  astro_asz

火花/中微子/任务丢失,从机被列入黑名单,执行器被移除

  •  1
  • astro_asz  · 技术社区  · 7 年前

    我正在Spark 2.2.0上运行Spark提交作业,Scala 2.11.11上运行Spark提交作业,Mesos 1.4.2上运行SBT。

    我有任务丢失和遗嘱执行人未登记的问题。症状如下:

    mesoscarsegrainedschedulerbackend启动任务,直到达到spark.cores.max。例如,它在这里启动6个任务:

    18/06/11 12:49:54 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585462 with attributes: Map() mem: 423417.0 cpu: 55.5 ports: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)).  Launching 2 Mesos tasks.
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 2 with mem: 11264.0 cpu: 20.0 ports: 
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 0 with mem: 11264.0 cpu: 20.0 ports: 
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585463 with attributes: Map() mem: 300665.0 cpu: 71.5 ports: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)).  Launching 3 Mesos tasks.
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 4 with mem: 11264.0 cpu: 20.0 ports: 
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 3 with mem: 11264.0 cpu: 20.0 ports: 
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 1 with mem: 11264.0 cpu: 20.0 ports: 
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585464 with attributes: Map() mem: 423417.0 cpu: 55.5 ports: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)).  Launching 1 Mesos tasks.
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 5 with mem: 11264.0 cpu: 20.0 ports: 
    18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585465 with attributes: Map() mem: 300665.0 cpu: 71.5 port: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)) for 120 seconds  (reason: reached spark.cores.max)
    

    然后紧接着它就开始失去任务,黑名单上的奴隶们甚至以为我已经设置好了 spark.blacklist.enabled=false

    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now TASK_LOST
    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S0 due to too many failures; is Spark installed on it?
    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 4 is now TASK_LOST
    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 3 is now TASK_LOST
    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S1 due to too many failures; is Spark installed on it?
    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_LOST
    18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S1 due to too many failures; is Spark installed on it?
    

    之后 non-existent 执行人被撤职

    18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
    18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 2 with reason Executor finished with state LOST
    18/06/11 12:49:56 INFO BlockManagerMaster: Removal of executor 2 requested
    18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585466 with attributes: Map() mem: 300665.0 cpu: 71.5 port: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)) 
    18/06/11 12:49:56 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
    18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585467 with attributes: Map() mem: 412153.0 cpu: 35.5 port: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)) 
    18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 0 with reason Executor finished with state LOST
    18/06/11 12:49:56 INFO BlockManagerMaster: Removal of executor 0 requested
    18/06/11 12:49:56 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
    18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 4 with reason Executor finished with state LOST
    18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 4 requested
    18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
    18/06/11 12:49:59 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 3 with reason Executor finished with state LOST
    18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 3 requested
    18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
    18/06/11 12:49:59 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 1 with reason Executor finished with state LOST
    18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 1 requested
    18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
    18/06/11 12:49:59 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_RUNNING
    18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
    18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
    18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
    18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
    18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
    

    注意,一个任务5没有丢失,执行器5没有被删除。

    18/06/11 12:49:59 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_RUNNING
    18/06/11 12:50:01 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (SlaveIp:46884) with ID 5
    18/06/11 12:50:01 INFO BlockManagerMasterEndpoint: Registering block manager SpaveIP:32840 with 5.2 GB RAM, BlockManagerId(5, SlaveIP, 32840, None)
    

    这是我的SparkSession设置:

    val spark = SparkSession.builder
    .config("spark.executor.cores", 20)
    .config("spark.executor.memory", "10g")
    .config("spark.sql.shuffle.partitions", numPartitionsShuffle)
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.network.timeout", "1200s")
    .config("spark.blacklist.enabled", false)
    .config("spark.blacklist.maxFailedTaskPerExecutor", 100)
    .config("spark.dynamicAllocation.enabled", false)
    .getOrCreate()
    

    这是我的Spark提交脚本

    spark-submit \
      --class MyMainClass \
      --master mesos://masterIP:7077 \
      --total-executor-cores 120 \
      --driver-memory 200g \
      --deploy-mode cluster \
      --name MyMainClass \
      --conf "spark.shuffle.service.enabled=false" \
      --conf "spark.dynamicAllocation.enabled=false" \
      --conf "spark.blacklist.enabled=false" \
      --conf "spark.blacklist.maxFailedTaskPerExecutor=100" \
      --verbose \
      myJar-assembly-0.1.0-SNAPSHOT.jar
    

    注:

    • 我注意到,如果我休息一下,做这项工作的话,它就会运转得很好。但是,如果我尝试快速连续地运行作业,或者在我杀死前一个作业之后运行作业,就会出现上面描述的问题。
    • 我的集群上有足够的资源来运行这些任务
    • 我正在复制SparkSession和Spark Submit中的设置,因为 config VS --conf 并不总是清楚的。
    • 在非动态模式下运行很重要。
    • 失去的遗嘱执行人是
    • 我将调试日志与基于Spark 2.0.1的旧的仍然处于活动状态的退役集群安装的日志进行了比较。完全相同的代码启动的任务可以立即获得 TASK_RUNNING 状态。
    • 我的google和stackoverflow搜索没有产生任何有用的信息。
    • 设置 spark.blacklist.maxFailedTaskPerExecutor spark.blacklist.enabled 好像没用
    • 相关的未回答问题[中微子上的火花(DC/OS)在做任何事情之前会丢失任务 ( Spark on Mesos (DC/OS) loses tasks before doing anything

    我完全不知道发生了什么。

    问题:

    1. 你需要更多的信息来帮助我诊断吗?
    2. 为什么工作一开始就失去了大部分的任务?我看见了 Task Reasons 但这些原因似乎都无法解释。
    3. 为什么这么说 要求移除不存在的执行者 ?
    4. 我应该朝哪个方向看?
    5. 是不是和之前的工作被杀,没有等足够长的时间来启动下一个工作有关?
    1 回复  |  直到 7 年前
        1
  •  0
  •   astro_asz    7 年前

    我在回答自己的问题:

    我们发现我们的问题有两个方面。

    1. 一些未确认的主任务和工作任务之间的通信/连接问题导致了mesos任务(执行器)的丢失。日志里没有解释这个问题的东西。
    2. 每次至少有两个mesos任务在一个工人身上丢失,它就会被列入黑名单。在火花2.2中,2的限制在代码中被硬编码,并且不能被改变。有关详细信息,请参见: Blacklist is always active for MesosCoarseGrainedSchedulerBackend

    因此:

    • 有时没有发生通信问题,工作正常进行。
    • 大多数情况下,所有的遗嘱执行人在工作开始时就失去了。在我们的集群中有两个工人,我们一次只能运行3个执行器。在作业开始时,所有执行器(Worker1上的2个和Worker2上的1个)都将丢失,但只有Worker1将被列入黑名单,丢失的执行器将在Worker2上重新启动,并继续正常运行。

    解决方案:

    我不确定这是否是这个问题的一般解决方案,但我们有点盲目地寻找调节不同介子的结构。 timeout 机制和我们在Mesos 1.4中发现了这个错误:

    Using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop

    作为一个测试,我们设置了 SparkSession 配置 spark.mesos.driver.failoverTimeout=1.0 . 这似乎解决了我们的问题。在开始工作的时候,我们不会再放松我们的遗嘱执行人。