代码之家  ›  专栏  ›  技术社区  ›  user3656280

如何在CSS中选择更改的选择器?

  •  2
  • user3656280  · 技术社区  · 7 年前

    我想从 Tmdb 但每个标题都有不同的选择器。有没有办法让我一次就把它们都弄到手?

    例如:css选择器 Birdman 是7. Star Wars 是9,其他电影有不同的。

    你可能会问,为什么不只是得到这样的标题 this 但这是因为我需要浏览每一页,以便了解流派。

    class PosterSpider(scrapy.Spider):
       name = "movieposter - imgsearch"
       start_urls = ["https://www.themoviedb.org/?language=en"]
    
        def parse(self, response):
            url = response.css('.logo~ li:nth-child(3) > a').xpath('//*~[contains(concat( " ", @class, " " ), concat( " ", "logo", " " ))]//li[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]//>//a')
            yield scrapy.Request(url.xpath("@href").extract_first(), self.parse_page)
    
        def parse_page(self, response):
            """
            Method to press the 'next' button and go through each movie poster
            """
    
            for href in response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "view_more", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "result", " " ))]'):
                yield scrapy.Request(href.xpath('@href').extract_first(), self.parse_covers)
    
            next = response.css('.glyphicons-circle-arrow-right').xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "glyphicons-circle-arrow-right", " " ))]')
            yield scrapy.Request(next.xpath("@href").extract_first(), self.parse_page)
    
        def parse_covers(self, response):
            img = response.css('.zoom a').xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "zoom", " " ))]//a')
    
            # what to put for this selector?
            title = response.css().extract_first()
    
            genre = response.css('.genres a').extract_first()
    
            yield MoviePoster(title=title, genre = genre, file_urls=[])
    
    2 回复  |  直到 7 年前
        1
  •  0
  •   G_M    7 年前

    我认为,当一个站点有一个API(并且它有你正在寻找的信息)时,你应该使用它,而不是废话。 TheMovieDB API 似乎每秒允许4个请求,注册只需一分钟。

    下面的脚本(使用 Python 3.6.4 )使用 total_pages=100 (您最多可以设置为 1000 根据API),每个页面有20个作为JSON返回的电影。我不得不进行单独的API调用以获得人类可读的类型,但一切似乎都很好。对于 100 第页,此代码花费了大约 40sec 运行,然后将所有结果保存到一个文件中,供您以后使用。

    import json
    import time
    
    import requests
    
    
    class PopularMovies:
        API_KEY = 'YOUR_API_KEY'
        BASE_URL = 'https://api.themoviedb.org/3'
    
        def __init__(self):
            self.session = requests.Session()
            self.genres = self._get_genres()
            self.popular_movies = []
    
        def _get_genres(self):
            params = {'api_key': self.API_KEY}
            r = self.session.get(
                '{}/genre/movie/list'.format(self.BASE_URL),
                params=params
            )
            r.raise_for_status()
            result = {}
            for genre in r.json()['genres']:
                result[genre['id']] = genre['name']
            return result
    
        def _add_readable_genres(self):
            for i in range(len(self.popular_movies)):
                current = self.popular_movies[i]
                genre_ids = current['genre_ids']
                current.update({
                    'genres': sorted(self.genres[g_id] for g_id in genre_ids)
                })
    
        def _get_popular_movies_page(self, *, page_num):
            params = {
                'api_key': self.API_KEY,
                'page': page_num,
                'sort_by': 'popularity.desc'
            }
            r = self.session.get(
                '{}/discover/movie'.format(self.BASE_URL),
                params=params
            )
            r.raise_for_status()
            return r.json()
    
        def get_popular_movie_pages(self, *, total_pages=1):
            if not (1 <= total_pages <= 1000):
                raise ValueError('total_pages must be between 1-1000')
    
            for page_num in range(1, total_pages + 1):
                movies = self._get_popular_movies_page(page_num=page_num)
                self.popular_movies.extend(movies['results'])
                time.sleep(0.25)  # 40 requests every 10 seconds, 1 every 0.25sec
    
            self._add_readable_genres()
    
        def write_to_file(self, *, filename='popular_movies.json'):
            with open(filename, 'w') as f:
                json.dump(self.popular_movies, f, indent=4)
    
    
    if __name__ == '__main__':
        movies = PopularMovies()
        movies.get_popular_movie_pages(total_pages=100)
        movies.write_to_file()
    
        # just to show that you can easily pick out the data you want
        with open('popular_movies.json', 'r') as f:
            movies = json.load(f)
            for i, movie in enumerate(movies, start=1):
                print('Title:\n\t{}'.format(movie['title']))
                print('Genre:')
                for genre in movie['genres']:
                    print('\t{}'.format(genre))
                print('-' * 20)
    

    此脚本的控制台输出太长,无法解决此问题 but here is a link to it

    而且 here is a link to popular_movies.json 显示您从每部电影中获得的额外信息(允许您在将来扩展到不仅仅是标题和类型)。

        2
  •  0
  •   Bill Bell    7 年前

    我认为,不是你要求的,而是一种做你想做的事的方法。

    通常的准备工作:

    >>> import requests
    >>> page = requests.get('https://www.themoviedb.org/movie?page=3&language=en').text
    >>> import bs4
    >>> soup = bs4.BeautifulSoup(page, 'lxml')
    

    现在使用 find_all 使用Python函数来识别 id 属性匹配“movie\”。

    >>> def movie_id(id):
    ...     return id and bs4.re.compile(r'^movie_').match(id)
    ... 
    >>> movies = soup.find_all(id=movie_id)
    

    在您突出显示以供考虑的页面中有61个。

    >>> len(movies)
    61
    

    这是第一项的内容。

    >>> movies[0]
    <a alt="Inside Out" class="result" href="/movie/150540?language=en" id="movie_150540" title="Inside Out">
    <img alt="Inside Out" class="poster lazyload fade" data-sizes="auto" data-src="https://image.tmdb.org/t/p/w185_and_h278_bestv2/aAmfIX3TT40zUHGcCKrlOZRKC7u.jpg" data-srcset="https://image.tmdb.org/t/p/w185_and_h278_bestv2/aAmfIX3TT40zUHGcCKrlOZRKC7u.jpg 1x, https://image.tmdb.org/t/p/w370_and_h556_bestv2/aAmfIX3TT40zUHGcCKrlOZRKC7u.jpg 2x"/>
    <div class="meta">
    <span class="hide popularity_rank_value" id="popularity_50cdfd9c19c2957b79385f6e_value">
    <div class="tooltip_popup popularity">
    <h3>Popularity Rank</h3>
    <p>Today: 42</p>
    <p>Last Week: 132</p>
    </div>
    </span>
    <span class="glyphicons glyphicons-cardio x1 popularity_rank" id="popularity_50cdfd9c19c2957b79385f6e"></span>
    <span class="right">
    </span>
    </div>
    </a>
    

    你可以用这种方法找出标题。

    >>> movies[0].attrs['title']
    'Inside Out'