代码之家  ›  专栏  ›  技术社区  ›  Tadej Lorber

权力游戏Wikipedia Python scraper

  •  -3
  • Tadej Lorber  · 技术社区  · 8 年前

    就像这样: S01E1:22万 S02E2:220万 . 第1季VIEV总数:xy

    总计:3.98亿

    如果有人做过类似的事情,请帮助:)

    import re
    import urllib
    
    from BeautifulSoup import BeautifulSoup
    
    wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
    wiki_html = urllib.urlopen(wiki_url).read()
    wiki_content = BeautifulSoup(wiki_html)
    
    seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
    seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
    
    views = 0
    
    for season in seasons:
        season_url = 'https://en.wikipedia.org' + season['href']
        season_html = urllib.urlopen(season_url).read()
        season_content = BeautifulSoup(season_html)
    
        episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
    
        if episodes_table:
            episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
    
            if episode_rows:
                for episode_row in episode_rows:
                    episode_views = episode_row.findAll('td')[-1]
    
                    views += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
    
    print 'The total number of views is ' + str(views) + ' millions'
    
    2 回复  |  直到 8 年前
        1
  •  0
  •   Miriam Farber    8 年前

    你可以像Ali告诉你的那样做,除了你不应该求和,而只是输出它,然后在我的例子中求和在单独的变量中:

    totalViewsPerSeason
    

    import re
    import urllib
    
    from BeautifulSoup import BeautifulSoup
    
    wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
    wiki_html = urllib.urlopen(wiki_url).read()
    wiki_content = BeautifulSoup(wiki_html)
    
    seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
    seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
    
    views = 0
    grandTotalViews = 0
    season_num = 1
    
    for season in seasons:
        season_url = 'https://en.wikipedia.org' + season['href']
        season_html = urllib.urlopen(season_url).read()
        season_content = BeautifulSoup(season_html)
    
        episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
    
        if episodes_table:
            episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
    
            if episode_rows:
                episode_num = 1
                totalViewsPerSeason = 0
                for episode_row in episode_rows:
                    episode_views = episode_row.findAll('td')[-1]
    
                    views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                    grandTotalViews += views
                    totalViewsPerSeason += views
                    print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
                    episode_num += 1
    
        print "Total season " + str(season_num) + " views: " + str(totalViewsPerSeason) + " Millions\n"
        season_num += 1
    
    print 'The total number of views is ' + str(grandTotalViews) + ' millions'
    
        2
  •  0
  •   Ali    8 年前

    在中不需要进行任何解析工作。我所要做的就是研究如何以你想要的格式在屏幕上输出结果,更像是字符串操作。

    import re
    import urllib
    from bs4 import BeautifulSoup
    
    wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
    wiki_html = urllib.urlopen(wiki_url).read()
    wiki_content = BeautifulSoup(wiki_html, 'html.parser')
    seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
    seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
    
    views = 0
    total = 0
    season_num = 1
    for season in seasons:
        season_url = 'https://en.wikipedia.org' + season['href']
        season_html = urllib.urlopen(season_url).read()
        season_content = BeautifulSoup(season_html,'html.parser')
        episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
        if episodes_table:
            episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
            if episode_rows:
                episode_num = 1
                for episode_row in episode_rows:
                    episode_views = episode_row.findAll('td')[-1]
                    views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                    total += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                    print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
                    episode_num += 1
        season_num += 1
    
    print 'The total number of views is ' + str(total) + ' millions'
    

    输出:

    S1E1 : 2.22 Millions
    S1E2 : 2.2 Millions
    S1E3 : 2.44 Millions
    S1E4 : 2.45 Millions
    S1E5 : 2.58 Millions
    S1E6 : 2.44 Millions
    S1E7 : 2.4 Millions
    S1E8 : 2.72 Millions
    S1E9 : 2.66 Millions
    S1E10 : 3.04 Millions
    S2E1 : 3.86 Millions
    S2E2 : 3.76 Millions
    S2E3 : 3.77 Millions
    S2E4 : 3.65 Millions
    S2E5 : 3.9 Millions
    S2E6 : 3.88 Millions
    S2E7 : 3.69 Millions
    S2E8 : 3.86 Millions
    S2E9 : 3.38 Millions
    S2E10 : 4.2 Millions
    S3E1 : 4.37 Millions
    S3E2 : 4.27 Millions
    S3E3 : 4.72 Millions
    S3E4 : 4.87 Millions
    S3E5 : 5.35 Millions
    S3E6 : 5.5 Millions
    S3E7 : 4.84 Millions
    S3E8 : 5.13 Millions
    S3E9 : 5.22 Millions
    S3E10 : 5.39 Millions
    S4E1 : 6.64 Millions
    S4E2 : 6.31 Millions
    S4E3 : 6.59 Millions
    S4E4 : 6.95 Millions
    S4E5 : 7.16 Millions
    S4E6 : 6.4 Millions
    S4E7 : 7.2 Millions
    S4E8 : 7.17 Millions
    S4E9 : 6.95 Millions
    S4E10 : 7.09 Millions
    S5E1 : 8.0 Millions
    S5E2 : 6.81 Millions
    S5E3 : 6.71 Millions
    S5E4 : 6.82 Millions
    S5E5 : 6.56 Millions
    S5E6 : 6.24 Millions
    S5E7 : 5.4 Millions
    S5E8 : 7.01 Millions
    S5E9 : 7.14 Millions
    S5E10 : 8.11 Millions
    S6E1 : 7.94 Millions
    S6E2 : 7.29 Millions
    S6E3 : 7.28 Millions
    S6E4 : 7.82 Millions
    S6E5 : 7.89 Millions
    S6E6 : 6.71 Millions
    S6E7 : 7.8 Millions
    S6E8 : 7.6 Millions
    S6E9 : 7.66 Millions
    S6E10 : 8.89 Millions
    S7E1 : 10.11 Millions
    S7E2 : 9.27 Millions
    S7E3 : 9.25 Millions
    S7E4 : 10.17 Millions
    S7E5 : 10.72 Millions
    S7E6 : 10.24 Millions
    S7E7 : 12.07 Millions
    The total number of views is 398.73 millions