代码之家 › 专栏 › 技术社区 › dongle

基于Python CSV行值的流控制

urlparse control-flow csv python

dongle · 技术社区 · 12 年前

我正在使用一个CSV,该CSV具有以下结构:

"2012-09-01 20:03:15","http://example.com"

这些数据是我浏览历史记录的清理垃圾。我感兴趣的是每天计算前五个唯一的域名。以下是我迄今为止所掌握的信息:

from urlparse import urlparse
import csv
from collections import Counter

domains = Counter()

with open("history.csv") as f:
    for row in csv.reader(f):
        d = row[0]
        dt = d[11:19]
        dt = dt.replace(":","")
        dd = d[0:10]
        if (dt < "090000") and (dt > "060000"):
            url = row[1]
            p = urlparse(url)
            ph = p.hostname
            print dd + "," + dt + "," + ph
            domains += Counter([ph])
t = str(domains.most_common(20))

我用d、dt和dd来分隔日期和时间。对于上面的示例行,dt=20:03:15,dd=2012-09-01。“如果(dt<“090000”)和(dt>“060000”)”只是说我只对计算早上6点到9点之间访问的网站感兴趣。我该怎么说“只计算每天早上6点之前访问的前五个网站”?任何一天都有数百行,这些行是按时间顺序排列的。

2 回复 | 直到 5 年前

jfs 12 年前

我感兴趣的是每天计算前五个唯一的域名。

import csv
from collections import defaultdict
from datetime import datetime
from urlparse import urlsplit

domains = defaultdict(lambda: defaultdict(int))
with open("history.csv", "rb") as f:
     for timestr, url in csv.reader(f):
         dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
         if 6 <= dt.hour < 9: # between 6am and 9am
            today_domains = domains[dt.date()] #  per given day
            domain = urlsplit(url).hostname
            if len(today_domains) < 5 or domain in today_domains:
               today_domains[domain] += 1 # count the first 5 unique domains

print(domains)

dongle 12 年前

import csv
from collections import defaultdict, Counter
from datetime import datetime
from urlparse import urlsplit

indiv = Counter()

domains = defaultdict(lambda: defaultdict(int))
with open("history.csv", "rb") as f:
    for timestr, url in csv.reader(f):
        dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
        if 6 <= dt.hour < 11: # between 6am and 11am
            today_domains = domains[dt.date()]
            domain = urlsplit(url).hostname
            if len(today_domains) < 5 and domain not in today_domains:
                today_domains[domain] += 1
                indiv += Counter([domain])
for domain in indiv:
    print '%s,%d' % (domain, indiv[domain])

推荐文章

igbins09 · 在shell bash脚本中使用jq将单行JSON转换为csv

2 年前

Ujjawal Pandey · 如何为矢量化数据帧创建行CSV?

2 年前

greens trial · 在Python中编辑CSV文件名以附加到当前文件名

2 年前

n328 · 如何将指数格式的值从csv读取到numpy数组中?

2 年前

Bilal Sedef · 如何快速组合特定列上的多个csv文件?

2 年前

christhebliss · 如何在一个csv列中写入分号分隔的值?

2 年前

Max J. · 用整数作为键将dict写入csv

2 年前

Sarai · Python中的CSV文件处理和计算值

2 年前

BabaZuri · 应用筛选器时将csv中的行添加到数组

2 年前

user18796731 · 在Python中以CSV格式保存数组元素

3 年前