代码之家  ›  专栏  ›  技术社区  ›  Clay

安装sparknlp后,无法导入sparknlp

  •  5
  • Clay  · 技术社区  · 7 年前

    以下内容在Cloudera CDSW群集网关上成功运行。

    import pyspark
    from pyspark.sql import SparkSession
    spark = (SparkSession
                .builder
                .config("spark.jars.packages","JohnSnowLabs:spark-nlp:1.2.3")
                .getOrCreate()
             )
    

    产生此输出。

    Ivy Default Cache set to: /home/cdsw/.ivy2/cache
    The jars for the packages stored in: /home/cdsw/.ivy2/jars
    :: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    JohnSnowLabs#spark-nlp added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found JohnSnowLabs#spark-nlp;1.2.3 in spark-packages
        found com.typesafe#config;1.3.0 in central
        found org.fusesource.leveldbjni#leveldbjni-all;1.8 in central
    downloading http://dl.bintray.com/spark-packages/maven/JohnSnowLabs/spark-nlp/1.2.3/spark-nlp-1.2.3.jar ...
        [SUCCESSFUL ] JohnSnowLabs#spark-nlp;1.2.3!spark-nlp.jar (3357ms)
    downloading https://repo1.maven.org/maven2/com/typesafe/config/1.3.0/config-1.3.0.jar ...
        [SUCCESSFUL ] com.typesafe#config;1.3.0!config.jar(bundle) (348ms)
    downloading https://repo1.maven.org/maven2/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.jar ...
        [SUCCESSFUL ] org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.jar(bundle) (382ms)
    :: resolution report :: resolve 3836ms :: artifacts dl 4095ms
        :: modules in use:
        JohnSnowLabs#spark-nlp;1.2.3 from spark-packages in [default]
        com.typesafe#config;1.3.0 from central in [default]
        org.fusesource.leveldbjni#leveldbjni-all;1.8 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
        ---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        3 artifacts copied, 0 already retrieved (5740kB/37ms)
    Setting default log level to "ERROR".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    

    但当我尝试导入sparknlp时,如上所述 John Snow Labs 对于pyspark。。。

    import sparknlp
    # or 
    from sparknlp.annotator import *
    

    我明白了:

    ImportError: No module named sparknlp
    ImportError: No module named sparknlp.annotator 
    

    使用sparknlp需要做什么?当然,这可以推广到任何Spark软件包。

    3 回复  |  直到 4 年前
        1
  •  4
  •   David Hall    7 年前

    您可以使用以下命令在PySpark中使用SparkNLP包:

    pyspark --packages JohnSnowLabs:spark-nlp:1.3.0
    

    但这并没有告诉Python在哪里可以找到绑定。遵循类似报告的说明 here ,可以通过将jar目录添加到PYTHONPATH来修复此问题:

    export PYTHONPATH="~/.ivy2/jars/JohnSnowLabs_spark-nlp-1.3.0.jar:$PYTHONPATH"
    

    或由

    import sys, glob, os
    sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))
    
        2
  •  3
  •   Clay    7 年前

    我想出来了。正确加载的jar文件只是编译的Scala文件。我仍然必须将包含包装器代码的Python文件放在可以导入的位置。一旦我这么做了,一切都很顺利。

        3
  •  0
  •   Madhup Kumar    4 年前

    多亏了克莱。下面是我如何设置PYTHONPATH的:

    git clone --branch 3.0.3 https://github.com/JohnSnowLabs/spark-nlp
    export PYTHONPATH="./spark-nlp/python:$PYTHONPATH"
    

    然后它对我起了作用,因为我的/spark nlp/python文件夹现在包含难以捉摸的sparknlp模块。

    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3
    
    >>> import sparknlp
    >>>