我认为,当一个站点有一个API(并且它有你正在寻找的信息)时,你应该使用它,而不是废话。
TheMovieDB API
似乎每秒允许4个请求,注册只需一分钟。
下面的脚本(使用
Python 3.6.4
)使用
total_pages=100
(您最多可以设置为
1000
根据API),每个页面有20个作为JSON返回的电影。我不得不进行单独的API调用以获得人类可读的类型,但一切似乎都很好。对于
100
第页,此代码花费了大约
40sec
运行,然后将所有结果保存到一个文件中,供您以后使用。
import json
import time
import requests
class PopularMovies:
API_KEY = 'YOUR_API_KEY'
BASE_URL = 'https://api.themoviedb.org/3'
def __init__(self):
self.session = requests.Session()
self.genres = self._get_genres()
self.popular_movies = []
def _get_genres(self):
params = {'api_key': self.API_KEY}
r = self.session.get(
'{}/genre/movie/list'.format(self.BASE_URL),
params=params
)
r.raise_for_status()
result = {}
for genre in r.json()['genres']:
result[genre['id']] = genre['name']
return result
def _add_readable_genres(self):
for i in range(len(self.popular_movies)):
current = self.popular_movies[i]
genre_ids = current['genre_ids']
current.update({
'genres': sorted(self.genres[g_id] for g_id in genre_ids)
})
def _get_popular_movies_page(self, *, page_num):
params = {
'api_key': self.API_KEY,
'page': page_num,
'sort_by': 'popularity.desc'
}
r = self.session.get(
'{}/discover/movie'.format(self.BASE_URL),
params=params
)
r.raise_for_status()
return r.json()
def get_popular_movie_pages(self, *, total_pages=1):
if not (1 <= total_pages <= 1000):
raise ValueError('total_pages must be between 1-1000')
for page_num in range(1, total_pages + 1):
movies = self._get_popular_movies_page(page_num=page_num)
self.popular_movies.extend(movies['results'])
time.sleep(0.25) # 40 requests every 10 seconds, 1 every 0.25sec
self._add_readable_genres()
def write_to_file(self, *, filename='popular_movies.json'):
with open(filename, 'w') as f:
json.dump(self.popular_movies, f, indent=4)
if __name__ == '__main__':
movies = PopularMovies()
movies.get_popular_movie_pages(total_pages=100)
movies.write_to_file()
# just to show that you can easily pick out the data you want
with open('popular_movies.json', 'r') as f:
movies = json.load(f)
for i, movie in enumerate(movies, start=1):
print('Title:\n\t{}'.format(movie['title']))
print('Genre:')
for genre in movie['genres']:
print('\t{}'.format(genre))
print('-' * 20)
此脚本的控制台输出太长,无法解决此问题
but here is a link to it
。
而且
here is a link to
popular_movies.json
显示您从每部电影中获得的额外信息(允许您在将来扩展到不仅仅是标题和类型)。