代码之家  ›  专栏  ›  技术社区  ›  Axle Max

BeautifulSoup为某些站点返回403错误

  •  0
  • Axle Max  · 技术社区  · 7 年前

    如果我手动访问URL,页面加载良好。除了403响应之外,没有任何错误消息,因此我不知道如何诊断问题。

    from bs4 import BeautifulSoup
    import requests    
    
    test_sites = [
     'http://fashiontoast.com/',
     'http://becauseimaddicted.net/',
     'http://www.lefashion.com/',
     'http://www.seaofshoes.com/',
     ]
    
    for site in test_sites:
        print(site)
        #get page soure
        response = requests.get(site)
        print(response)
        #print(response.text)
    

    运行上述代码的结果是。。。

    http://fashiontoast.com/
    
    Response [403]
    
    http://becauseimaddicted.net/
    
    Response [403]
    
    http://www.lefashion.com/
    
    Response [200]
    
    http://www.seaofshoes.com/
    
    Response [200]
    

    有人能帮我了解问题的原因和解决方法吗?

    1 回复  |  直到 7 年前
        1
  •  2
  •   chitown88    7 年前

    有时,页面拒绝未标识用户代理的GET请求。

    使用浏览器(Chrome)访问页面。右clcik然后“检查”。复制GET请求的用户代理标头(查看网络选项卡)。

    enter image description here

    from bs4 import BeautifulSoup
    import requests
    
    with requests.Session() as se:
        se.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
            "Accept-Encoding": "gzip, deflate",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Accept-Language": "en"
        }
    
    
    test_sites = [
     'http://fashiontoast.com/',
     'http://becauseimaddicted.net/',
     'http://www.lefashion.com/',
     'http://www.seaofshoes.com/',
     ]
    
    for site in test_sites:
        print(site)
        #get page soure
        response = se.get(site)
        print(response)
        #print(response.text)
    

    输出:

    http://fashiontoast.com/
    <Response [200]>
    http://becauseimaddicted.net/
    <Response [200]>
    http://www.lefashion.com/
    <Response [200]>
    http://www.seaofshoes.com/
    <Response [200]>