代码之家  ›  专栏  ›  技术社区  ›  aeronesto

如何使用ingest attachment插件和JavaScript客户端在Elasticsearch 6.1中索引PDF?

  •  3
  • aeronesto  · 技术社区  · 7 年前

    我试着按照以下问题答案中的说明进行操作:

    How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

    我找不到许多用于ElasticSearch的JavaScript客户端示例,因此我有以下内容:

    创建索引

    // elasticsearch Client
    var elasticsearch = require('elasticsearch');
    var client = new elasticsearch.Client({hosts: [ 'http://localhost:9200/']});
    
    // Create index
    client.create({index: 'pdfs', type: 'pdf', id: 'my-index-id', 
        body: {description: 'Test pdf indexing'}
    })
    .then(function () {console.log("Index created");})
    .catch(function (error) {console.log(error);});
    

    定义进入节点的索引映射:

    var body = {
        pdf:{
            properties:{
                title : {"type" : "keyword", "index" : "false"},
                type  : {"type" : "keyword", "index" : "false"},
                "attachment.pdf" : {"type" : "keyword"}
            }
        }
    }
    
    client.indices.putMapping({index:"pdfs", type:"pdf", body:body})
    .then((response) => {addPipeline()})
    .catch((error) => {console.log("putMapping error: " + error)})
    

    使用PUT API在节点群集中定义摄取管道

    function addPipeline(){
      client.ingest.putPipeline({
        id: 'my-pipeline-id',
        body: {
          "description" : "parse pdfs and index into ES",
          "processors" : [
            { "attachment" : { "field" : "pdf", "indexed_chars" : -1 } },
            { "remove" : { "field" : "pdf" } }
          ]
        }
      })
      .then(function () {
         console.log("putPipeline Resolved");
       })
      .catch(function (error) {
         console.log("putPipeline error: " + error);
       });
    };
    

    在尝试上载PDF之前,我检查了索引是否已创建:

    curl -XGET 'localhost:9200/_cat/indices?v&pretty'
    

    结果:

    health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    yellow open   .kibana EaUbEQCETVKQbYThrhPGaA   1   1          1            0      3.6kb          3.6kb
    yellow open   pdfs    Z2SR-ApFR9SYsvY08tgSZw   5   1          1            0      4.6kb          4.6kb
    

    当我尝试 为PDF编制索引 使用以下命令,我得到一个错误。

    curl -H 'Content-Type: application/pdf' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d'
    {
        "pdf": @/Users/user/path/to/pdf/file.pdf
    }'
    

    错误:

    {"error":"Content-Type header [application/pdf] is not supported","status":406}
    

    这是因为我的PDF不是Base64编码的,还是我做错了什么? 我正在尝试创建一个数字图书馆来搜索PDF。

    更新时间:

    我将pdf编码为:

    openssl base64 -in /Users/user/path/to/pdf/file.pdf -out base64_encoded_file
    

    重新创建了我的索引,并在base64\u encoded\u文件上运行了以下命令:

    curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d @/base64_encoded_file
    

    我得到了以下错误:

    Warning: Couldn't read data from file "/base64_encoded_file", this makes an empty POST.
    {"error":{"root_cause":[{"type":"parse_exception","reason":"request body is required"}],"type":"parse_exception","reason":"request body is required"},"status":400}
    

    我尝试将文件添加为正文:

    curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
            {
              "pdf" : @/base64_encoded_file
            }'
    

    错误:

    {"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to parse content to map"}],"type":"parse_exception","reason":"Failed to parse content to map","caused_by":{"type":"json_parse_exception","reason":"Unexpected character ('@' (code 64)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@6db5a3dc; line: 3, column: 16]"}},"status":400}
    

    哈尔普

    1 回复  |  直到 7 年前
        1
  •  6
  •   aeronesto    7 年前

    我找到了问题的答案:

    Elasticsearch不从源so获取数据,

    curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
            {
              "pdf" : @/base64_encoded_file
            }'
    

    行不通。“字段”来自 attachment options (在我的示例中,“pdf”)必须是数据,而不是文件路径。 This 线程解释了向elasticsearch发送[pdf]内容的三个选项:

    1. 您可以[从pdf中]提取内容,只需将要索引的内容发送到elasticsearch即可。
    2. 您可以将二进制BASE64发送给elasticsearch ingest,由其执行提取
    3. 您可以将二进制文件发送到FSCrawler,FSCrawler将在发送到elasticsearch之前进行提取。

    简而言之,传递给elasticsearch的数据必须如 documentation

    curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
        {
            "pdf" : "base64_encoded_data"
        }'