代码之家  ›  专栏  ›  技术社区  ›  dongle

基于Python CSV行值的流控制

  •  2
  • dongle  · 技术社区  · 12 年前

    我正在使用一个CSV,该CSV具有以下结构:

    "2012-09-01 20:03:15","http://example.com"
    

    这些数据是我浏览历史记录的清理垃圾。我感兴趣的是每天计算前五个唯一的域名。以下是我迄今为止所掌握的信息:

    from urlparse import urlparse
    import csv
    from collections import Counter
    
    domains = Counter()
    
    with open("history.csv") as f:
        for row in csv.reader(f):
            d = row[0]
            dt = d[11:19]
            dt = dt.replace(":","")
            dd = d[0:10]
            if (dt < "090000") and (dt > "060000"):
                url = row[1]
                p = urlparse(url)
                ph = p.hostname
                print dd + "," + dt + "," + ph
                domains += Counter([ph])
    t = str(domains.most_common(20))
    

    我用d、dt和dd来分隔日期和时间。对于上面的示例行,dt=20:03:15,dd=2012-09-01。“如果(dt<“090000”)和(dt>“060000”)”只是说我只对计算早上6点到9点之间访问的网站感兴趣。我该怎么说“只计算每天早上6点之前访问的前五个网站”?任何一天都有数百行,这些行是按时间顺序排列的。

    2 回复  |  直到 5 年前
        1
  •  3
  •   jfs    12 年前

    我感兴趣的是每天计算前五个唯一的域名。

    import csv
    from collections import defaultdict
    from datetime import datetime
    from urlparse import urlsplit
    
    domains = defaultdict(lambda: defaultdict(int))
    with open("history.csv", "rb") as f:
         for timestr, url in csv.reader(f):
             dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
             if 6 <= dt.hour < 9: # between 6am and 9am
                today_domains = domains[dt.date()] #  per given day
                domain = urlsplit(url).hostname
                if len(today_domains) < 5 or domain in today_domains:
                   today_domains[domain] += 1 # count the first 5 unique domains
    
    print(domains)
    
        2
  •  1
  •   dongle    12 年前
    import csv
    from collections import defaultdict, Counter
    from datetime import datetime
    from urlparse import urlsplit
    
    indiv = Counter()
    
    domains = defaultdict(lambda: defaultdict(int))
    with open("history.csv", "rb") as f:
        for timestr, url in csv.reader(f):
            dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
            if 6 <= dt.hour < 11: # between 6am and 11am
                today_domains = domains[dt.date()]
                domain = urlsplit(url).hostname
                if len(today_domains) < 5 and domain not in today_domains:
                    today_domains[domain] += 1
                    indiv += Counter([domain])
    for domain in indiv:
        print '%s,%d' % (domain, indiv[domain])