代码之家  ›  专栏  ›  技术社区  ›  johns7843

如何使用xpath收集所有HREF?Selenium-Python

  •  0
  • johns7843  · 技术社区  · 3 年前

    我试图从本例中的艺术家那里收集所有(5)个社交媒体链接。目前,我的输出只是最后一个(第五个)社交媒体链接。我正在使用硒,我知道这不是收集这些数据的最佳选择,但这是我目前所知道的全部。 注意,我只为我的问题包含了相关代码。提前感谢您的帮助/见解。

        from cgitb import text
        from os import link
        from selenium import webdriver
        from selenium.webdriver.support.wait import WebDriverWait
        from selenium.webdriver.common.by import By
        from selenium.webdriver.support import expected_conditions as EC
        from selenium.webdriver.chrome.options import Options
        import time
        from random import randint
        import pandas as pd
    
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('disable-infobars')
        chrome_options.add_argument('--disable-extensions')
        chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
        driver = webdriver.Chrome(chrome_options=chrome_options)
    
    
    
    
    for url in urls:
    driver.get(https://soundcloud.com/flux-pavilion)
    
    
    time.sleep(randint(3,4))
    
    
    try:
        links = driver.find_elements_by_xpath('//*[@id="content"]/div/div[4]/div[2]/div/article[1]/div[2]/ul/li//a[@href]')
        for elem in links:
            socialmedia = (elem.get_attribute("href"))
    
    
    except:
            links = "none"
    
    artist = {
        'socialmedia': socialmedia,
        }
    
    print(artist)
    
    1 回复  |  直到 3 年前
        1
  •  0
  •   zx485 potemkin    3 年前

    问题不在于XPath表达式,而在于输出代码的列表处理(不存在)。

    您的代码只输出了结果XPath列表的最后一项。这就是为什么您只收到一个链接(这是最后一个链接)的问题所在。

    因此,将代码的输出部分更改为

    [...]
    
    url = driver.get("https://soundcloud.com/flux-pavilion")    
    time.sleep(randint(3,4))
    artist = []
    
    try:
        links = driver.find_elements_by_xpath('//*[@id="content"]/div/div[4]/div[2]/div/article[1]/div[2]/ul/li//a[@href]')
        for elem in links:
            artist.append(elem.get_attribute("href"))
    
    
    except:
            links = "none"
    
    for link in artist:
        print(link)
    

    输出将包含您想要的所有值(链接):

    driver = webdriver.Chrome(chrome_options=chrome_options)
    https://gate.sc/?url=https%3A%2F%2Ftwitter.com%2FFluxpavilion&token=da4a8d-1-1653430570528
    https://gate.sc/?url=https%3A%2F%2Finstagram.com%2FFluxpavilion&token=277ea0-1-1653430570529
    https://gate.sc/?url=https%3A%2F%2Ffacebook.com%2FFluxpavilion&token=4c773c-1-1653430570530
    https://gate.sc/?url=https%3A%2F%2Fyoutube.com%2FFluxpavilion&token=1353f7-1-1653430570531
    https://gate.sc/?url=https%3A%2F%2Fopen.spotify.com%2Fartist%2F7muzHifhMdnfN1xncRLOqk%3Fsi%3DbK9XeoW5RxyMlA-W9uVwPw&token=bc2936-1-1653430570532