代码之家  ›  专栏  ›  技术社区  ›  Gavin Alfaro

无法返回图像URL,只能获取数据:image/gif;base64在抓取网站时

  •  -1
  • Gavin Alfaro  · 技术社区  · 2 年前

    我设置了一个简单的python脚本,从H&M.返回名称时没有遇到问题,但图像URL似乎只返回前几个,然后采用以下格式:“data:image.gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAAAAAAABAAEAAAIBRAA7”我分别尝试过请求和selenium与chromedriver。我错过了什么?

    首次尝试(请求):

    import requests
    from bs4 import BeautifulSoup
    
    # URL of the H&M men's section
    url = "https://www2.hm.com/en_us/men/products/view-all.html?page=1"
    
    # Headers to mimic a browser visit
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Referer": "https://www.google.com/",
        "Connection": "keep-alive"
    }
    
    # Send a GET request to the webpage
    response = requests.get(url, headers=headers)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
    
        # Find all the product items
        items = soup.find_all('article', class_='f0cf84')
    
        # Iterate over the items and extract the name and image URL
        for item in items:
            # Extract the product name
            name = item.find('a', class_='db7c79')['title']
            
            # Extract the image URL (the 'src' attribute of the <img> tag)
            img_tag = item.find('img', imagetype='PRODUCT_IMAGE')
            img_url = img_tag['src'] if img_tag else 'No image'
    
            # Print the name and image URL
            print(f"Product Name: {name}")
            print(f"Image URL: {img_url}\n")
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
    
    

    第二次尝试(硒)

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    driver = webdriver.Chrome()
    
    # URL of the H&M men's section
    url = "https://www2.hm.com/en_us/men/products/view-all.html?page=1"
    
    # Open the webpage
    driver.get(url)
    
    # Get the page source and parse it with BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # Find all the product items
    items = soup.find_all('article', class_='f0cf84')
    
    # Iterate over the items and extract the name and image URL
    for item in items:
        # Extract the product name
        name = item.find('a', class_='db7c79')['title']
    
        # Extract the image URL (the 'src' attribute of the <img> tag)
        img_tag = item.find('img', imagetype='PRODUCT_IMAGE')
        img_url = img_tag['src'] if img_tag else 'No image'
    
        # Print the name and image URL
        print(f"Product Name: {name}")
        print(f"Image URL: {img_url}\n")
    
    # Quit the WebDriver
    driver.quit()
    

    两次的响应都是相同的:

    Product Name: Baggy Jeans
    Image URL: https://image.hm.com/assets/hm/9e/53/9e53035efef96606bc4b50eaf6a0eee4f08a152c.jpg?imwidth=1536
    
    Product Name: Regular Fit Cotton Shorts
    Image URL: https://image.hm.com/assets/hm/8f/d8/8fd8d52f2e2c778041410f9a2727b448053ca8b7.jpg?imwidth=1536
    
    Product Name: Regular Fit Linen-blend Shorts
    Image URL: https://image.hm.com/assets/hm/d7/54/d7546a095c04387d1ad98575588c84e0426fb4be.jpg?imwidth=1536
    
    Product Name: Muscle Fit Cotton Shirt
    Image URL: https://image.hm.com/assets/hm/c7/d4/c7d49cef60f9d196d2f5347815f416bba7d4b636.jpg?imwidth=1536
    
    Product Name: Slim Fit Ribbed Tank Top
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Slim Fit Jacket
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: 5-pack Slim Fit T-shirts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Linen-blend Resort Shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Slim Fit Suit Pants
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Cotton Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Slim Fit Suit Pants
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Slim Fit Polo Shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Baggy Jeans
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Slim Fit Half-zip Polo Shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Slim Fit Linen Jacket
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Loose Fit Cargo Jeans
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Nylon Cargo Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Loose Fit T-shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Loose Jeans
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Swim Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Chino Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Muscle Fit Polo Shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Linen-blend Pants
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: 5-pack Short Cotton Boxer Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Cropped Cotton Chinos
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Linen-blend Shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Slim Fit Linen Suit Pants
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Linen-blend Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Swim Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Patterned Swim Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Patterned Swim Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit T-shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit T-shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Sweatshorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Regular Fit Cotton Shorts
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    Product Name: Slim Fit T-shirt
    Image URL: data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
    
    2 回复  |  直到 2 年前
        1
  •  0
  •   Sergey K    2 年前

    u可以通过请求从页面末尾的静态JSON中检索信息

    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
    
        # Find all the product items
        items = json.loads(soup.find('script', {'id': '__NEXT_DATA__'}).get_text())
        for item in items['props']['pageProps']['plpProps']['productListingProps']['hits']:
            print(f"Product Name: {item['title']}")
            print(f"Image URL: {'https://image.hm.com/' + item['imageProductSrc']}\n")
    

    输出:

    Product Name: Baggy Jeans
    Image URL: https://image.hm.com/assets/hm/3d/dd/3ddd1d7ee3dece637a88557a759b3502868b6ccd.jpg
    
    Product Name: Regular Fit Cotton Shorts
    Image URL: https://image.hm.com/assets/hm/00/79/00792d85a6f093d63513805bb50755be65e625b6.jpg
    
    Product Name: Regular Fit Linen-blend Shorts
    Image URL: https://image.hm.com/assets/hm/84/57/8457f6ee69e78e52a6a066f59dab6d416f4755d6.jpg
    
    Product Name: Muscle Fit Cotton Shirt
    Image URL: https://image.hm.com/assets/hm/c7/d4/c7d49cef60f9d196d2f5347815f416bba7d4b636.jpg
    
    Product Name: Slim Fit Ribbed Tank Top
    Image URL: https://image.hm.com/assets/hm/c1/2a/c12a71d4223049325463e8858352d9c88e5d1590.jpg
    
    Product Name: Slim Fit Jacket
    Image URL: https://image.hm.com/assets/hm/19/63/1963f78945f45ae8ea8c7f784c48197d7579675e.jpg
    
    Product Name: 5-pack Slim Fit T-shirts
    Image URL: https://image.hm.com/assets/hm/ef/84/ef847140f2084137e9142930801734502ab52ace.jpg
    
    Product Name: Regular Fit Linen-blend Resort Shirt
    Image URL: https://image.hm.com/assets/hm/64/5d/645da6d33d00dff6c973409498e0165435c0f35e.jpg
    
    Product Name: Slim Fit Suit Pants
    Image URL: https://image.hm.com/assets/hm/9a/11/9a113712bb917e853c24d444d7bf6dda63e84f0b.jpg
    
    Product Name: Regular Fit Cotton Shorts
    Image URL: https://image.hm.com/assets/hm/dd/a6/dda65d63ce74413808875fda0348e03878832232.jpg
    
    Product Name: Slim Fit Suit Pants
    Image URL: https://image.hm.com/assets/hm/c7/ff/c7ff2ba7d6eca8119908fcd7daf9066d3a8412dd.jpg
    
    Product Name: Slim Fit Polo Shirt
    Image URL: https://image.hm.com/assets/hm/fc/d7/fcd760bafdfb48ea8cccde14c5e3ad338dd96bfd.jpg
    
    Product Name: Baggy Jeans
    Image URL: https://image.hm.com/assets/hm/bc/fd/bcfd3f19e72773a735a3261355f490f6e2554238.jpg
    
    Product Name: Slim Fit Half-zip Polo Shirt
    Image URL: https://image.hm.com/assets/hm/03/3d/033dd2b17620eac8ebdf949c76cdc2b046a6bbd6.jpg
    
    Product Name: Slim Fit Linen Jacket
    Image URL: https://image.hm.com/assets/hm/5e/96/5e96bc27780b7002c2d97993b4f94bbfde01d610.jpg
    
    Product Name: Loose Fit Cargo Jeans
    Image URL: https://image.hm.com/assets/hm/fc/38/fc382304cef5c2a33a9a5a1b3c8cfc2e2c056f8b.jpg
    
    Product Name: Regular Fit Nylon Cargo Shorts
    Image URL: https://image.hm.com/assets/hm/a3/7d/a37db012d5f826763fece602dab3c7d44d8911c0.jpg
    
    Product Name: Loose Fit T-shirt
    Image URL: https://image.hm.com/assets/hm/e3/56/e3568a1492d1a9149da0401120fd82357a020eb0.jpg
    
    Product Name: Loose Jeans
    Image URL: https://image.hm.com/assets/hm/2c/77/2c77a9ff7cf1bc0cd4f2c2c94c23cff06ea3d555.jpg
    
    Product Name: Swim Shorts
    Image URL: https://image.hm.com/assets/hm/53/e6/53e6dbccb7a06a0d875217791b48d5a4c3c1def7.jpg
    
    Product Name: Regular Fit Chino Shorts
    Image URL: https://image.hm.com/assets/hm/f6/77/f677a6aab0df3447d0d6f6ab3146b1a78a7b5048.jpg
    
    Product Name: Muscle Fit Polo Shirt
    Image URL: https://image.hm.com/assets/hm/20/6f/206fbe107fe2aa85222b7b231e874274bf2421c6.jpg
    
    Product Name: Regular Fit Linen-blend Pants
    Image URL: https://image.hm.com/assets/hm/bb/79/bb79892f3ca98c59acdc959168fa7501c686c057.jpg
    
    Product Name: 5-pack Short Cotton Boxer Shorts
    Image URL: https://image.hm.com/assets/hm/3a/c7/3ac702fb6c64fe556b5033b0656e87bc64a5f921.jpg
    
    Product Name: Regular Fit Cropped Cotton Chinos
    Image URL: https://image.hm.com/assets/hm/94/52/945293e8bca2e00d9498a1250f541e5a372506ad.jpg
    
    Product Name: Regular Fit Linen-blend Shirt
    Image URL: https://image.hm.com/assets/hm/d5/9a/d59a0a9ccd4cc6ddeff0ffec007ae718b82e70fe.jpg
    
    Product Name: Slim Fit Linen Suit Pants
    Image URL: https://image.hm.com/assets/hm/eb/28/eb28c996f3b65e20bdf182e2d082016e61aa469c.jpg
    
    Product Name: Regular Fit Linen-blend Shorts
    Image URL: https://image.hm.com/assets/hm/f8/a8/f8a885cf83338303825afbda849304ba099f2d92.jpg
    
    Product Name: Swim Shorts
    Image URL: https://image.hm.com/assets/hm/31/05/3105df2f9e33d9c2c9665a819c0b55eef54e466a.jpg
    
    Product Name: Patterned Swim Shorts
    Image URL: https://image.hm.com/assets/hm/cb/e6/cbe6dafb3fb3ab98dbf1e502bf1af24bec4f2b1d.jpg
    
    Product Name: Patterned Swim Shorts
    Image URL: https://image.hm.com/assets/hm/74/1a/741a2c7c93a9e266060411d83c0d26435248fa7b.jpg
    
    Product Name: Regular Fit T-shirt
    Image URL: https://image.hm.com/assets/hm/44/42/4442fbac4e3080ec20b2f14e353fea267249b0dd.jpg
    
    Product Name: Regular Fit T-shirt
    Image URL: https://image.hm.com/assets/hm/bd/e4/bde4ef42f917ccb678c4ff1d218520ce2f10ff6d.jpg
    
    Product Name: Regular Fit Sweatshorts
    Image URL: https://image.hm.com/assets/hm/34/54/3454e1358929cdf81bccf06ac6e38372d00807f2.jpg
    
    Product Name: Regular Fit Cotton Shorts
    Image URL: https://image.hm.com/assets/hm/a2/a1/a2a105ee22bf93da3b28deb11f9d408b2b0bff4b.jpg
    
    Product Name: Slim Fit T-shirt
    Image URL: https://image.hm.com/assets/hm/09/58/0958cc08f86b7127b5dd8e0d0091824a337b6588.jpg
    
        2
  •  0
  •   BSimjoo    2 年前

    它们是以base64格式嵌入的图像数据,没有任何外部资源的URL。您只需转换base64并以原始格式保存即可