代码之家 › 专栏 › 技术社区 › cite

更好的SQL-:group vs.:select=>“DISTINCT”

activerecord postgresql ruby-on-rails sql

cite · 技术社区 · 16 年前

class Mailbox < ActiveRecord::Base
  has_many :addresses
  has_many :domains, :through => :addresses
end

class Address < ActiveRecord::Base
  belongs_to :mailbox
  belongs_to :domain
end

class Domain < ActiveRecord::Base
  has_many :addresses
  has_many :mailboxes, :through => :addresses
end

现在很明显,如果你想知道任何给定的邮箱在哪些域中有地址,你有两种可能的方法:

m = Mailbox.first
# either: SELECT DISTINCT domains.id, domains.name FROM "domains" INNER JOIN 
#         "addresses" ON "domains".id = "addresses".domain_id WHERE 
#         (("addresses".mailbox_id = 1))
m.domains.all(:select => 'DISTINCT domains.id, domains.name')
# or: SELECT domains.id, domains.name FROM "domains" INNER JOIN "addresses" ON
#     "domains".id = "addresses".domain_id WHERE (("addresses".mailbox_id = 1))
#      GROUP BY domains.id, domains.name
m.domains.all(:select => 'domains.id, domains.name', 
  :group => 'domains.id, domains.name')

使用“DISTINCT”:

 Unique  (cost=16.56..16.57 rows=1 width=150)
   ->  Sort  (cost=16.56..16.56 rows=1 width=150)
         Sort Key: domains.name, domains.id
         ->  Nested Loop  (cost=0.00..16.55 rows=1 width=150)
               ->  Index Scan using index_addresses_on_mailbox_id on addresses  (cost=0.00..8.27 rows=1 width=4)
                     Index Cond: (mailbox_id = 1)
               ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=150)
                     Index Cond: (domains.id = addresses.domain_id)
                     Filter: (domains.active AND domains.selfmgmt)
(9 rows)

使用“分组依据”:

Group  (cost=16.56..16.57 rows=1 width=150)
   ->  Sort  (cost=16.56..16.56 rows=1 width=150)
         Sort Key: domains.name, domains.id
         ->  Nested Loop  (cost=0.00..16.55 rows=1 width=150)
               ->  Index Scan using index_addresses_on_mailbox_id on addresses  (cost=0.00..8.27 rows=1 width=4)
                     Index Cond: (mailbox_id = 1)
               ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=150)
                     Index Cond: (domains.id = addresses.domain_id)
                     Filter: (domains.active AND domains.selfmgmt)
(9 rows)

我真的不确定如何确定检索这些数据的更好方法。我的直觉告诉我使用“GROUP BY”,但我找不到任何足够具体的文档来解决这个问题。

我应该使用“:group”还是“:select=>'DISTINCT'”?这种选择与其他现代RDBMS(如Oracle、DB2或MySQL)相同吗(我没有访问这些RDBMS的权限,所以我无法执行测试)?

2 回复 | 直到 16 年前

user80168 user80168 16 年前

如果您正在使用Postgresql<8.4(考虑到计划,我想你是)-通常最好使用 GROUP BY 而不是 DISTINCT 因为它的计划更有效。

在8.4中没有区别,因为DISTINCT也被“教导”能够使用群运算符。

mjv 16 年前

SQL必须执行相同的步骤来收集所需的信息,为GROUP BY或“等效”DISTINCT查询做准备:相同的过滤器、相同的排序等。区别在于最后一步,这里Postgre称之为“组”或“唯一”。

当GROUP BY查询在其SELECT中不包含任何聚合时,“GROUP”和“Unique”本质上是同一件事(即使它们归结为服务器代码中的不同方法)。在存在某些聚合的情况下,例如COUNT(*)、MAX(some_field)等。“组”步骤需要更多的资源,因为它需要额外的存储空间来存储计数、最小值等,并且需要递增或与之进行比较等。

我会选择GROUP BY方法,因为如果在某个时候需要这样的信息,可以在SELECT中添加聚合,而无需对查询进行太多更改。即使没有显示实际计数,应用程序也可以按此计数的降序显示域。