代码之家 › 专栏 › 技术社区 › user3783243

mysql匹配“相同”邮件

self-join regex sql mysql

user3783243 · 技术社区 · 6 年前

我有一张有两列的桌子 email 和 id . 我需要找到密切相关的电子邮件。例如:

john.smith12@example.com

和

john.smith12@some.subdomains.example.com

这些应该被认为是相同的,因为用户名( john.smith12 )最顶级的领域( example.com )都是一样的。它们当前在我的表中有两个不同的行。 ~~我已经编写了下面的表达式,它应该进行比较,但执行起来需要几个小时(可能是/可能是因为regex)。有没有更好的方法来写这个:~~

  select c1.email, c2.email 
  from table as c1
  join table as c2
   on (
             c1.leadid <> c2.leadid 
        and 
             c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))

此查询的解释返回为:

id, select_type, table, type, possible_keys, key, key_len, ref,  rows,   Extra
1,  SIMPLE,      c1,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, NULL
1,  SIMPLE,      c2,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, Using where; Using join buffer (Block Nested Loop)

创建表是:

CREATE TABLE `table` (
 `ID` int(11) NOT NULL AUTO_INCREMENT,
 `Email` varchar(100) DEFAULT NULL,
 KEY `Table_Email` (`Email`),
 KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

我想由于regexp,索引没有被使用。

正则表达式显示为:

john[.]smith12@[^@]*example[.]com

两个地址都应该匹配。

更新:

我已经修改了 on 成为:

on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and    
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))

以及 explain 使用这种方法,至少可以使用索引。

id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)

到目前为止,这已经执行了5分钟,如果有很大的改进,将会更新。

更新2:

我已经拆分了电子邮件,所以用户名是一列,域是一列。我以相反的顺序存储了域,这样它的索引就可以与后面的通配符一起使用。

CREATE TABLE `table` (
     `ID` int(11) NOT NULL AUTO_INCREMENT,
     `Email` varchar(100) DEFAULT NULL,
     `domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
     `username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
     KEY `Table_Email` (`Email`),
     KEY `Email` (`Email`),
     KEY `domain` (`domain`)
    ) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

填充新列的查询:

update table
set username = trim(SUBSTRING_INDEX(trim(email), '@', 1)), 
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '@', -1), '.', -3)));

新查询:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

新解释结果:

1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

从这个解释看来 domain 未使用索引。我还试图强制使用 USE 但这也不起作用,导致没有使用指数:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

用解释 use :

1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

4 回复 | 直到 6 年前

Vladimir Baranov 6 年前

你告诉我们这张桌子有70万行。

这不算多,但你是把它连在一起的,所以在最坏的情况下,引擎必须处理 700K * 700K = 490 000 000 000 = 490B 排。

索引在这里绝对有帮助。

最佳索引取决于数据分布。

下面的查询返回什么?

SELECT COUNT(DISTINCT username) 
FROM table

如果结果接近700k,比如说100k,那么这意味着有很多不同的用户名,你最好关注它们,而不是 domain . 如果结果很低,比如说100,比索引 username 不太可能有用。

我希望有很多不同的用户名,因此,我将在 用户名 ,因为查询使用简单的相等比较在该列上联接,而此联接将从该索引中受益匪浅。

另一个要考虑的选项是 (username, domain) 甚至覆盖指数 (username, domain, leadid, email) . 索引定义中的列顺序很重要。

我会删除所有其他索引,这样乐观者就不能做出其他选择,除非有其他查询可能需要它们。

很可能在表上定义主键也不会造成伤害。

还有一件不重要的事情要考虑。您的数据真的有空值吗?如果没有,则将列定义为 NOT NULL . 此外,在许多情况下,最好是使用空字符串,而不是空字符串,除非您有非常具体的要求,并且必须区分空字符串和空字符串。 '' .

查询将稍微简单一些:

select 
    c1.email, c2.email, 
    c1.domain, c2.domain, 
    c1.username, c2.username, 
    c1.leadid, c2.leadid
from 
    table as c1
    join table as c2
        on  c1.username = c2.username 
        and c1.domain like concat(c2.domain, '%')
        and c1.leadid <> c2.leadid

Rick James diyism 6 年前

不 REGEXP_REPLACE 需要,因此它将在所有版本的mysql/mariadb中工作:

UPDATE tbl
    SET email = CONCAT(SUBSTRING_INDEX(email, '@', 1),
                       '@',
                       SUBSTRING_INDEX(
                           SUBSTRING_INDEX(email, '@', -1),
                           '.',
                           -2);

因为没有索引是有用的,所以您也不必为 WHERE 条款。

Anthony BONNIER 6 年前

如果您搜索相关的数据,您应该查找一些数据挖掘工具或弹性搜索,例如,它们可以按您的需要工作。

我有另一个可能的“仅数据库”解决方案,但我不知道它是否有效,或者它是否是最好的解决方案。如果我必须这样做,我会尝试做一个“单词引用”表,由所有非字母数字字符拆分所有电子邮件填写。

在您的示例中,此表将填充:john、smith12、some、subdomain、example和com。每个单词都有一个唯一的ID。然后,另一个表,一个联合表,它将电子邮件与其自己的单词链接起来。两个表都需要索引。

要搜索密切相关的电子邮件,您必须使用regex拆分源电子邮件,并在每个子单词上循环, like this one in the answer (通过连接),然后在单词引用表中查找每个单词,然后在联合表中查找与之匹配的电子邮件。

在这个请求中,您可以通过按电子邮件分组来计算找到的电子邮件匹配的单词数,并只保留最匹配的电子邮件(当然,不包括源电子邮件),从而选择所有匹配的电子邮件的总和。

对这个“不确定的回答”深表歉意,但发表评论的时间太长了。我要举个例子。

下面是一个示例(在Oracle中,但应该与MySQL一起使用),其中包含一些数据:

---------------------------------------------
-- Table containing emails and people info
CREATE TABLE PEOPLE (
     ID NUMBER(11) PRIMARY KEY NOT NULL,
     EMAIL varchar2(100) DEFAULT NULL,
     USERNAME varchar2(500) DEFAULT NULL
);

-- Table containing word references
CREATE TABLE WORD_REF (
     ID number(11) NOT NULL PRIMARY KEY,
     WORD varchar2(20) DEFAULT NULL
);

-- Table containg id's of both previous tables
CREATE TABLE UNION_TABLE (
     EMAIL_ID number(11) NOT NULL,
     WORD_ID number(11) NOT NULL,
     CONSTRAINT EMAIL_FK FOREIGN KEY (EMAIL_ID) REFERENCES PEOPLE (ID),
     CONSTRAINT WORD_FK FOREIGN KEY (WORD_ID) REFERENCES WORD_REF (ID)
);

-- Here is my oracle sequence to simulate the auto increment
CREATE SEQUENCE MY_SEQ
  MINVALUE 1
  MAXVALUE 999999
  START WITH 1
  INCREMENT BY 1
  CACHE 20;

---------------------------------------------
-- Some data in the people table
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12@example.com', 'jsmith12');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12@some.subdomains.example.com', 'admin');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.doe@another.domain.eu', 'jdo');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'nathan.smith@example.domain.com', 'nsmith');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'david.cayne@some.domain.st', 'davidcayne');
COMMIT;

-- Word reference data from the people data
INSERT INTO WORD_REF (ID, WORD) 
  (select MY_SEQ.NEXTVAL, WORD FROM
   (select distinct REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
    from PEOPLE
    CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL
  ));
COMMIT;

-- Union table filling
INSERT INTO UNION_TABLE (EMAIL_ID, WORD_ID)
select words.ID EMAIL_ID, word_ref.ID WORD_ID
FROM 
(select distinct ID, REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
 from PEOPLE
 CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL) words
left join WORD_REF on word_ref.word = words.WORD;
COMMIT;    

---------------------------------------------
-- Finaly, the request which orders the emails which match the source email 'john.smith12@example.com'
SELECT COUNT(1) email_match
      ,email
FROM   (SELECT word_ref.id
              ,words.word
              ,uni.email_id
              ,ppl.email
        FROM   (SELECT DISTINCT regexp_substr('john.smith12@example.com'
                                             ,'\w+'
                                             ,1
                                             ,LEVEL) word
                FROM   dual
                CONNECT BY regexp_substr('john.smith12@example.com'
                                        ,'\w+'
                                        ,1
                                        ,LEVEL) IS NOT NULL) words
        LEFT   JOIN word_ref
        ON     word_ref.word = words.word
        LEFT   JOIN union_table uni
        ON     uni.word_id = word_ref.id
        LEFT   JOIN people ppl
        ON     ppl.id = uni.email_id)
WHERE  email <> 'john.smith12@example.com'
GROUP  BY email_match DESC;

请求结果:

    4    john.smith12@some.subdomains.example.com
    2    nathan.smith@example.domain.com
    1    john.doe@another.domain.eu

Thorsten Kettner 6 年前

您可以使用

substring_index(email, '@', 1)

你得到的域名

substring_index(replace(email, '@', '.'), '.', -2))

(因为如果我们用点替换“@”,那么它总是在第二个点到最后一个点之后的部分)。

因此你可以找到

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and substring_index(other.email, '@', 1) = 
        substring_index(users.email, '@', 1)
    and substring_index(replace(other.email, '@', '.'), '.', -2) =
        substring_index(replace(users.email, '@', '.'), '.', -2)
);

如果速度太慢,则可能需要在这两个组合的列上创建计算列并对其进行索引:

alter table users add main_email as 
  concat(substring_index(email, '@', 1), '@', substring_index(replace(email, '@', '.'), '.', -2));

create index idx on users(main_email);

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and other.main_email = users.main_email
);

当然,您也可以将这两个部分分开并对其进行索引:

alter table users add email_name as substring_index(email, '@', 1);
alter table users add email_domain as substring_index(replace(email, '@', '.'), '.', -2);

create index idx on users(email_name, email_domain);

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and other.email_name = users.email_name
    and other.email_domain = users.email_dome
);

当然,如果您在电子邮件地址列中同时允许大写和小写,那么您还需要应用 LOWER 在上面的表达式中( lower(email) )