代码之家  ›  专栏  ›  技术社区  ›  Alex Baidan

显示每种类型值执行时间的最大n个结果

  •  -1
  • Alex Baidan  · 技术社区  · 7 年前

    我需要为多个模型构建suggester(autocomplete)。我有一个特殊的表格,其中包含所有需要建议的模型的搜索向量记录。

    希望显示每个 searchable_type 列值。例如,如果我们有4个类型的值: ['Artist', 'Artwork', 'Category', 'Nation'] 并且具有与搜索查询匹配的类型的记录,则结果必须与示例中的类似。

    例如n=3,query='titl':

    id title          search             searchable_type searchable_id
    1  Title 1        'title':1A         'Artist'        121
    1  Titlimbo       'titlimbo':1A      'Artist'        122
    1  Titlover       'titlover':1A      'Artist'        123
    1  Titleart       'titleart':1A      'Artwork'       124
    1  Titless        'titless':1A       'Artwork'       125
    1  Titlecat       'titlecat':1A      'Category'      126
    1  Titledog       'titledog':1A      'Category'      127
    1  TitleNation 1  'titlenation':1A   'Nation'        128
    

    我有一个查询可以完成这个任务,并且在表中有2000条记录。但当我决定用150k条记录在表中测试这个查询时,我感到不快。查询执行时间提高到24分钟!这是相当大的时间为实时提示。

    所以事实上,我做错了。我正在寻求帮助,以实现这样的疑问和解释为什么会发生这种情况。


    创建表查询

    CREATE TABLE pg_search_documents (
      id bigint NOT NULL CONSTRAINT pg_search_documents_pkey PRIMARY KEY,
      title character varying,
      search tsvector,
      searchable_type character varying,
      searchable_id bigint,
      created_at timestamp without time zone NOT NULL,
      updated_at timestamp without time zone NOT NULL
    );
    
    CREATE INDEX index_pg_search_documents_on_search ON pg_search_documents USING gin (search);
    CREATE INDEX index_pg_search_documents_on_searchable_type_and_searchable_id ON pg_search_documents USING btree (searchable_type, searchable_id);
    

    查询(限制=5,搜索=FIR)

      SELECT DISTINCT t_outer.searchable_type, t_top.id, t_top.title, t_top.searchable_id, t_top.updated_at FROM pg_search_documents t_outer
        JOIN LATERAL (
             SELECT * FROM pg_search_documents t_inner
             WHERE t_inner.searchable_type = t_outer.searchable_type AND ((t_inner.search) @@ (to_tsquery('simple', ''' ' || 'fir' || ' ''' || ':*')))
             ORDER BY t_inner.updated_at DESC
             LIMIT 5
             ) t_top ON true
      ORDER BY t_top.updated_at DESC
    

    150K记录的执行时间最长为25分钟。


    EXPLAIN (ANALYZE, BUFFERS) SELECT DISTINCT t_top.id, t_top.title, t_top.searchable_id, t_top.updated_at FROM pg_search_documents AS t_outer
                                                                                                                 JOIN LATERAL (
        SELECT * FROM pg_search_documents AS t_inner
        WHERE t_inner.searchable_type = t_outer.searchable_type AND ((t_inner.search) @@ (to_tsquery('simple', ''' ' || 'fir' || ' ''' || ':*')))
        ORDER BY t_inner.updated_at DESC
        LIMIT 5
        ) AS t_top ON true
        ORDER BY t_top.updated_at DESC;
    
    
    Unique  (cost=301102555.51..301111909.01 rows=5 width=60) (actual time=161305.761..161796.379 rows=10 loops=1)
      Buffers: shared hit=74382891, temp read=19008 written=19046
      ->  Sort  (cost=301102555.51..301104426.21 rows=748280 width=60) (actual time=161305.759..161616.199 rows=748065 loops=1)
            Sort Key: t_inner.updated_at DESC, t_inner.id, t_inner.title, t_inner.searchable_id
            Sort Method: external merge  Disk: 90312kB
            Buffers: shared hit=74382891, temp read=19008 written=19046
            ->  Nested Loop  (cost=2010.95..300973275.75 rows=748280 width=60) (actual time=0.904..160242.631 rows=748065 loops=1)
                  Buffers: shared hit=74382891
                  ->  Seq Scan on pg_search_documents t_outer  (cost=0.00..5355.56 rows=149656 width=7) (actual time=0.008..49.066 rows=149656 loops=1)
                        Buffers: shared hit=3859
                  ->  Limit  (cost=2010.95..2010.96 rows=5 width=132) (actual time=1.067..1.068 rows=5 loops=149656)
                        Buffers: shared hit=74379032
                        ->  Sort  (cost=2010.95..2011.45 rows=201 width=132) (actual time=1.065..1.066 rows=5 loops=149656)
                              Sort Key: t_inner.updated_at DESC
                              Sort Method: top-N heapsort  Memory: 26kB
                              Buffers: shared hit=74379032
                              ->  Bitmap Heap Scan on pg_search_documents t_inner  (cost=30.09..2007.61 rows=201 width=132) (actual time=0.338..0.803 rows=795 loops=149656)
                                    Recheck Cond: (search @@ '''fir'':*'::tsquery)
                                    Filter: ((searchable_type)::text = (t_outer.searchable_type)::text)
                                    Rows Removed by Filter: 98
                                    Heap Blocks: exact=73780408
                                    Buffers: shared hit=74379032
                                    ->  Bitmap Index Scan on index_pg_search_documents_on_search  (cost=0.00..30.04 rows=805 width=0) (actual time=0.277..0.277 rows=893 loops=149656)
                                          Index Cond: (search @@ '''fir'':*'::tsquery)
                                          Buffers: shared hit=598624
    Planning time: 0.220 ms
    Execution time: 161893.484 ms
    

    如果从侧面连接的位置改变 t_inner.searchable_type = t_outer.searchable_type t_inner.searchable_type = 'Artist' 执行时间为464毫秒(正常),但输出结果错误(不正常)。

    解释

    Sort  (cost=26369.27..26369.32 rows=20 width=67)
      Sort Key: t_top.updated_at DESC
      ->  HashAggregate  (cost=26368.64..26368.84 rows=20 width=67)
            Group Key: t_top.updated_at, t_outer.searchable_type, t_top.id, t_top.title, t_top.searchable_id
            ->  Nested Loop  (cost=2273.74..17000.20 rows=749475 width=67)
                  ->  Seq Scan on pg_search_documents t_outer  (cost=0.00..5357.95 rows=149895 width=7)
                  ->  Materialize  (cost=2273.74..2273.82 rows=5 width=60)
                        ->  Subquery Scan on t_top  (cost=2273.74..2273.80 rows=5 width=60)
                              ->  Limit  (cost=2273.74..2273.75 rows=5 width=132)
                                    ->  Sort  (cost=2273.74..2275.53 rows=718 width=132)
                                          Sort Key: t_inner.updated_at DESC
                                          ->  Bitmap Heap Scan on pg_search_documents t_inner  (cost=282.23..2261.81 rows=718 width=132)
                                                Recheck Cond: (search @@ '''fir'':*'::tsquery)
                                                Filter: ((searchable_type)::text = 'Artwork'::text)
                                                ->  Bitmap Index Scan on index_pg_search_documents_on_search  (cost=0.00..282.05 rows=806 width=0)
                                                      Index Cond: (search @@ '''fir'':*'::tsquery)
    

    所以据我所知,问题在于检查这类等号。

    更新

    查询速度慢的原因之一是表中搜索列的gin索引可能已损坏。Postgres建议在大数据迁移之前关闭它(我没有)。因此,在删除并创建索引之后,我的搜索查询开始变得更快-每个请求2.5分钟-但这也是一个巨大的时间。

    1 回复  |  直到 7 年前
        1
  •  0
  •   Alex Baidan    7 年前

    由推荐的具有窗口功能的解决方案 a_horse_with_no_name .

    这个解决方案对150K行的表中的执行时间没有问题,我将用这个解决方案替换横向查询。

    SELECT rank_filter.* FROM (
                              SELECT pg_search_documents.*,
                                     rank() OVER (
                                       PARTITION BY searchable_type
                                       ORDER BY created_at DESC
                                       )
                              FROM pg_search_documents
                              WHERE ((search) @@ (to_tsquery('simple', ''' ' || '#{query}' || ' ''' || ':*')))
                              ) rank_filter WHERE RANK <= 5
    

    EXPLAIN (ANALYZE, BUFFERS) SELECT rank_filter.* FROM (
                              SELECT pg_search_documents.*,
                                     rank() OVER (
                                       PARTITION BY searchable_type
                                       ORDER BY created_at DESC
                                       )
                              FROM pg_search_documents
                              WHERE ((search) @@ (to_tsquery('simple', ''' ' || 'fir' || ' ''' || ':*')))
                              ) rank_filter WHERE RANK <= 5;
    
    Subquery Scan on rank_filter  (cost=2044.61..2070.77 rows=268 width=184) (actual time=1.628..2.275 rows=10 loops=1)
      Filter: (rank_filter.rank <= 5)
      Rows Removed by Filter: 883
      Buffers: shared hit=497
      ->  WindowAgg  (cost=2044.61..2060.71 rows=805 width=184) (actual time=1.627..2.206 rows=893 loops=1)
            Buffers: shared hit=497
            ->  Sort  (cost=2044.61..2046.62 rows=805 width=176) (actual time=1.622..1.684 rows=893 loops=1)
                  Sort Key: pg_search_documents.searchable_type, pg_search_documents.created_at DESC
                  Sort Method: quicksort  Memory: 417kB
                  Buffers: shared hit=497
                  ->  Bitmap Heap Scan on pg_search_documents  (cost=30.24..2005.75 rows=805 width=176) (actual time=0.300..1.007 rows=893 loops=1)
                        Recheck Cond: (search @@ '''fir'':*'::tsquery)
                        Heap Blocks: exact=493
                        Buffers: shared hit=497
                        ->  Bitmap Index Scan on index_pg_search_documents_on_search  (cost=0.00..30.04 rows=805 width=0) (actual time=0.251..0.251 rows=893 loops=1)
                              Index Cond: (search @@ '''fir'':*'::tsquery)
                              Buffers: shared hit=4
    Planning time: 0.180 ms
    Execution time: 2.317 ms