代码之家  ›  专栏  ›  技术社区  ›  Essex

ElasticSearch和python:搜索功能问题

  •  0
  • Essex  · 技术社区  · 6 年前

    我第一次尝试使用 ElasticSearch 6.4 使用已写入的现有Web应用程序 Python/Django . 我有一些问题,我想知道为什么以及如何解决这些问题。

    α,α,β,β,β,β,α,β,β,β,β

    存在的

    α,α,β,β,β,β,α,β,β,β,β

    在我的应用程序中,可以上载文档文件(例如.pdf或.doc)。然后,我的应用程序中有一个搜索功能,允许在上传时搜索通过ElasticSearch索引的文档。

    文档标题的书写方式始终相同:

    YEAR - DOC_TYPE - ORGANISATION - document_title.extension
    

    例如:

    1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf
    

    搜索功能总是在 doc_type = ANNUAL_REPORT . 因为有几种文件类型(年度报告,其他,…)。

    ##################

    #我的环境:#

    ##################

    根据我的弹性搜索部分,这是一些数据。我也在学习ES命令。

    $ curl -XGET http://127.0.0.1:9200/_cat/indices?v
    health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    yellow open   app  5T0HZTbmQU2-ZNJXlNb-zg   5   1        742            2    396.4kb        396.4kb
    

    所以我的索引是 app

    对于上面的示例,如果搜索此文档: 1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf 我有:

    $ curl -XGET http://127.0.0.1:9200/app/annual-report/1343?pretty
    {
      "_index" : "app",
      "_type" : "annual-report",
      "_id" : "1343",
      "_version" : 33,
      "found" : true,
      "_source" : {
        "attachment" : {
          "date" : "2010-03-04T12:08:00Z",
          "content_type" : "application/pdf",
          "author" : "manshanden",
          "language" : "et",
          "title" : "Microsoft Word - Test document Word.doc",
          "content" : "some text ...",
          "content_length" : 3926
        },
        "relative_path" : "app_docs/APP-TEST/1970_ANNUAL_REPORT_APP-TEST_1342.pdf",
        "title" : "1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf"
      }
    }
    

    现在,通过Web应用程序中的搜索部分,我希望通过此搜索找到此文档: 1970 .

    def search_in_annual(self, q):
        try:
            response = self.es.search(
                index='app', doc_type='annual-report',
                q=q, _source_exclude=['data'], size=5000)
        except ConnectionError:
            return -1, None
    
        total = 0
        hits = []
        if response:
            for hit in response["hits"]["hits"]:
                hits.append({
                    'id': hit['_id'],
                    'title': hit['_source']['title'],
                    'file': hit['_source']['relative_path'],
                })
    
            total = response["hits"]["total"]
    
        return total, hits
    

    但是什么时候 q=1970 结果是 0

    如果我写:

    response = self.es.search(
                    index='app', doc_type='annual-report',
                    q="q*", _source_exclude=['data'], size=5000)
    

    它返回我的文档,但许多文档也没有 一千九百七十 在标题或文档内容中。

    #################

    #我的全局代码:#

    #################

    这是管理索引函数的全局类:

    class EdqmES(object):
        host = 'localhost'
        port = 9200
        es = None
    
        def __init__(self, *args, **kwargs):
            self.host = kwargs.pop('host', self.host)
            self.port = kwargs.pop('port', self.port)
    
            # Connect to ElasticSearch server
            self.es = Elasticsearch([{
                'host': self.host,
                'port': self.port
            }])
    
        def __str__(self):
            return self.host + ':' + self.port
    
        @staticmethod
        def file_encode(filename):
            with open(filename, "rb") as f:
                return b64encode(f.read()).decode('utf-8')
    
        def create_pipeline(self):
            body = {
                "description": "Extract attachment information",
                "processors": [
                    {"attachment": {
                        "field": "data",
                        "target_field": "attachment",
                        "indexed_chars": -1
                    }},
                    {"remove": {"field": "data"}}
                ]
            }
            self.es.index(
                index='_ingest',
                doc_type='pipeline',
                id='attachment',
                body=body
            )
    
        def index_document(self, doc, bulk=False):
            filename = doc.get_filename()
    
            try:
                data = self.file_encode(filename)
            except IOError:
                data = ''
                print('ERROR with ' + filename)
                # TODO: log error
    
            item_body = {
                '_id': doc.id,
                'data': data,
                'relative_path': str(doc.file),
                'title': doc.title,
            }
    
            if bulk:
                return item_body
    
            result1 = self.es.index(
                index='app', doc_type='annual-report',
                id=doc.id,
                pipeline='attachment',
                body=item_body,
                request_timeout=60
            )
            print(result1)
            return result1
    
        def index_annual_reports(self):
            list_docs = Document.objects.filter(category=Document.OPT_ANNUAL)
    
            print(list_docs.count())
            self.create_pipeline()
    
            bulk = []
            inserted = 0
            for doc in list_docs:
                inserted += 1
                bulk.append(self.index_document(doc, True))
    
                if inserted == 20:
                    inserted = 0
                    try:
                        print(helpers.bulk(self.es, bulk, index='app',
                                           doc_type='annual-report',
                                           pipeline='attachment',
                                           request_timeout=60))
                    except BulkIndexError as err:
                        print(err)
                    bulk = []
    
            if inserted:
                print(helpers.bulk(
                    self.es, bulk, index='app',
                    doc_type='annual-report',
                    pipeline='attachment', request_timeout=60))
    

    我的文档在他提交感谢带有信号的django表单时被索引:

    @receiver(signals.post_save, sender=Document, dispatch_uid='add_new_doc')
    def add_document_handler(sender, instance=None, created=False, **kwargs):
        """ When a document is created index new annual report (only) with Elasticsearch and update conformity date if the
        document is a new declaration of conformity
    
        :param sender: Class which is concerned
        :type sender: the model class
        :param instance: Object which was just saved
        :type instance: model instance
        :param created: True for a creation, False for an update
        :type created: boolean
        :param kwargs: Additional parameter of the signal
        :type kwargs: dict
        """
    
        if not created:
            return
    
        # Index only annual reports
        elif instance.category == Document.OPT_ANNUAL:
            es = EdqmES()
            es.index_document(instance)
    
    1 回复  |  直到 6 年前
        1
  •  0
  •   Essex    6 年前

    这就是我所做的,而且似乎有效:

    def search_in_annual(self, q):
        try:
            response = self.es.search(
                index='app', doc_type='annual-report', q=q, _source_exclude=['data'], size=5000)
    
            if response['hits']['total'] == 0:
    
                response = self.es.search(
                    index='app', doc_type='annual-report',
                    body={
                        "query":
                            {"prefix": {"title": q}},
                    }, _source_exclude=['data'], size=5000)
    
        except ConnectionError:
            return -1, None
    
        total = 0
        hits = []
        if response:
            for hit in response["hits"]["hits"]:
                hits.append({
                    'id': hit['_id'],
                    'title': hit['_source']['title'],
                    'file': hit['_source']['relative_path'],
                })
    
            total = response["hits"]["total"]
        return total, hits
    

    它允许搜索标题、前缀和内容以查找我的文档。