代码之家  ›  专栏  ›  技术社区  ›  Regressor

如何解析Google BigQuery中的电子邮件地址

  •  1
  • Regressor  · 技术社区  · 6 年前

    我在bigquery中有一个如下的表-

    with temp as (
    select "john.doe@company-Y.com" as email_id
    union all 
    select "hello.world@company-X.com" as email_id
    )
    select * from temp
    

    我想生成3个新列( firstname, lastname, company )来自 email_id 字段,以便输出-

    firstname, lastname, company
    john         doe      company-Y
    hello        world    company-X
    

    哪个bigquery函数可以用于此操作?

    1 回复  |  直到 6 年前
        1
  •  3
  •   Mikhail Berlyant    6 年前

    下面是BigQuery标准SQL

    有太多的方法可以做到这一点,下面是快速(首先想到)一:

    #standardSQL
    WITH temp AS (
      SELECT "john.doe@company-Y.com" AS email_id UNION ALL 
      SELECT "hello.world@company-X.com" AS email_id
    )
    SELECT 
      SPLIT(SPLIT(email_id, '@')[SAFE_OFFSET(0)], '.')[SAFE_OFFSET(0)] firstname,
      SPLIT(SPLIT(email_id, '@')[SAFE_OFFSET(0)], '.')[SAFE_OFFSET(1)] lastname,
      SPLIT(SPLIT(email_id, '@')[SAFE_OFFSET(1)], '.')[SAFE_OFFSET(0)] company
    FROM temp    
    

    以结果

    Row firstname   lastname    company  
    1   john        doe         company-Y    
    2   hello       world       company-X      
    

    但真正正确的解决方案将取决于数据的性质和模式,以及明显的个人偏好等。

    另一个快速的选择是

    #standardSQL
    WITH temp AS (
      SELECT "john.doe@company-Y.com" AS email_id UNION ALL 
      SELECT "hello.world@company-X.com" AS email_id 
    )
    SELECT 
      REGEXP_EXTRACT(email_id, r'^(.*?)[.@]') firstname,
      REGEXP_EXTRACT(email_id, r'\.(.*?)@') lastname,
      REGEXP_EXTRACT(email_id, r'@(.*?)\.') company
    FROM temp
    

    结果相同

    只需稍微扩展一点—这样您就可以看到改进的方向—例如,如果名称是由bu分隔的, . -

    #standardSQL
    WITH temp AS (
      SELECT "john.doe@company-Y.com" AS email_id UNION ALL 
      SELECT "hello.world@company-X.com" AS email_id UNION ALL
      SELECT "hello-world@company-X.com" AS email_id UNION ALL
      SELECT "hello@company-X.com" AS email_id
    )
    SELECT email_id,
      REGEXP_EXTRACT(email_id, r'^(.*?)[-.@]') firstname,
      REGEXP_EXTRACT(email_id, r'[-.](.*?)@') lastname,
      REGEXP_EXTRACT(email_id, r'@(.*?)\.') company
    FROM temp
    

    以结果

    Row email_id                    firstname   lastname    company  
    1   john.doe@company-Y.com      john        doe         company-Y    
    2   hello.world@company-X.com   hello       world       company-X    
    3   hello-world@company-X.com   hello       world       company-X    
    4   hello@company-X.com         hello       null        company-X