代码之家  ›  专栏  ›  技术社区  ›  mikezang

使用python+Selenium+Beautiful Soup查找固定字符串的URL

  •  -1
  • mikezang  · 技术社区  · 7 年前

    我有一些网址如下:

    imsges = 
    <img class="wni-logo" src="https://smtgvs.weathernews.jp/s/topics/img/wnilogo_kana@2x.png"/>
    <img alt="top" id="top_img" src="//smtgvs.weathernews.jp/s/topics/img/201808/201808170115_top_img_A.jpg?1534474260" style="width: 100%;"/>
    <img alt="box0" id="box_img0" src="//smtgvs.weathernews.jp/s/topics/img/201808/201808170115_box_img0_A.png?1534474573" style="width:100%"/>
    <img alt="box1" class="lazy" data-original="https://smtgvs.weathernews.jp" id="box_img1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" style="width: 100%; display: none;"/>
    <img alt="recommend thumb0" height="70" src="https://smtgvs.weathernews.jp/s/topics/thumb/article/201808080245_top_img_A_320x240.jpg?1534473603" width="100px"/>
    

    ['https://smtgvs.weathernews.jp/s/topics/img/201808/201808170115_top_img_A.jpg']
    ['https://smtgvs.weathernews.jp/s/topics/img/201808/201808170115_box_img0_A.png']
    

    我试过这个密码:

    for image in images:
        imageURL = re.findall('https://smtgvs.weathernews.jp/s/topics/img/.+', urljoin(baseURL, image['src']))
    
        if imageURL:
            print(imageURL)
    

    我有结果了,你能帮我纠正一下吗?

    ['https://smtgvs.weathernews.jp/s/topics/img/201808/201808170115_top_img_A.jpg?1534474260']
    ['https://smtgvs.weathernews.jp/s/topics/img/201808/201808170115_box_img0_A.jpg?1534474573']
    ['https://smtgvs.weathernews.jp/s/topics/img/dummy.png']
    
    1 回复  |  直到 7 年前
        1
  •  1
  •   Erwan    7 年前

    您可以直接使用捕获组更改regex

    for image in images:
         imageURL = re.findall("(https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+)\?[0-9]+", urljoin(baseURL, image['src']))
    
    if imageURL:
        print(imageURL)
    

    编辑:要获取原始数据而不是src字段:

    soup = BeautifulSoup(html_doc, 'html.parser')
    for image in soup.find_all("img"):
        print(image.get("data-original"))
    
    推荐文章