代码之家  ›  专栏  ›  技术社区  ›  aviss

从Facebook帖子中获取原始链接和标题

  •  0
  • aviss  · 技术社区  · 7 年前

    我需要收集一些Facebook分析没有提供的信息。例如,一篇文章的原始网址和标题在Facebook上作为链接帖子进行了推广。这些信息隐藏在Facebook帖子的HTML代码中,但我很难找到。会感谢你的帮助。

    举个例子: https://www.facebook.com/bbcnews/posts/10156428513547217

    我为一个链接(bbc.in…)确定了类:“6ks” 以及标题:“MBS m6 cnj s6c”

    下面的代码不返回任何内容:

    from bs4 import BeautifulSoup
    import requests
    link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
    r = requests.get(link)
    soup = BeautifulSoup(r.content, "lxml")
    for paragraph in soup.find_all("div", class_="_6ks"):
        for a in paragraph("a"):
           print(a.get('href'))
    for paragraph in soup.find_all("div", class_='mbs _6m6 _2cnj _5s6c'):
        for a in paragraph("a"):
           print(a.get('hover'))
    
    2 回复  |  直到 7 年前
        1
  •  1
  •   robots.txt    7 年前

    实现这一点的另一种方法如下:

    from bs4 import BeautifulSoup
    import requests
    
    link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
    
    res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
    comment = res.text.replace("-->", "").replace("<!--", "")
    soup = BeautifulSoup(comment, "lxml")
    items = soup.select_one('.mbs a')
    print(items.get("href")+"\n",items.text)
    
        2
  •  1
  •   Bitto    7 年前

    你不能得到任何输出的原因是因为这两个div都被巧妙地放置在注释标签中。 <!-- --> .分析程序忽略注释。如果打印汤,两个div都存在,但在注释标签中。

    我们可以得到评论,然后用它做一个新的汤来绕过这个问题。

    from bs4 import BeautifulSoup
    from bs4 import Comment
    import requests
    link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
    headers={'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'}
    r = requests.get(link,headers=headers)
    soup = BeautifulSoup(r.content, "lxml")
    comments=soup.find_all(string=lambda text:isinstance(text,Comment))
    soup=BeautifulSoup(comments[0], "lxml")
    for paragraph in soup.find_all("div", class_="_6ks"):
        for a in paragraph("a"):
           print(a.get('href'))
    print('-------------------------------------------------------------------')
    for paragraph in soup.find_all("div", class_='mbs _6m6 _2cnj _5s6c'):
        for a in paragraph("a"):
           print(a.text)
    

    产量

    https://l.facebook.com/l.php?u=https%3A%2F%2Fbbc.in%2F2FP4EgR&h=AT3jWrl9cgJEY-8NBLgbvOEtDSZ8dBABo4TJaVJ66QBbWdCsBypvAkN6MD7VhJoOgy_LGJeomQAlcwtex_Ab-7TvWXhKkLB1m_TjzxOSk3R2uP8qTUL3aTTj4Pcz2ZSZunWxZsPtOlJSpay_AtQfNTuLTUQ80OrtvRiDMs8duN3b27IH2UPnGThQ_YGJAcYJdPE3R9JbyxSQNhJ8yTmaRJe8pMNbgVkentXU4p3liys2IQvphwRd0V8ANmo-4xvKj1dRADHy3hOyUkcv_L2u8Z4WpLx1AZQCTitvfSLvhQRMZ0cK1vIjkuv3gfurRf250p3D54GxQZIsVLymDzNtLbOnigIuFRHfQFAUSBDzJGTqQB3hs4lilYyFXIqaC2cdXwDp8GDrmYbgRWmEMmN6A5fHDdRlF4m7MXJO0vJ_7uqkh0TAdcvTSc0dqt5Wv3wOoEN5S1b2ddLZOp3DFwApAGkSHsOtW7Pjc-STFljuV045ERsUWUbmnALSl9vxB6tiZ0poa3aGxZqnlFqsaTB-A8plwCWp5ed9JALlurBco447aELbpuRexqoOajxTvS_yW9BdSXaufzpbPFKaNt5go7uf4GjdekpITCApJo2JoAOzzsfKHdg1MXasOCw
    -------------------------------------------------------------------
    MPs put forward rival Brexit plans