代码之家  ›  专栏  ›  技术社区  ›  Yannis Dran

提取存储在磁盘上的html文件的url和名称,并分别打印它们-Python

  •  -1
  • Yannis Dran  · 技术社区  · 8 年前

    我试图提取和打印URL及其名称(在 <a href='url' title='smth'>NAME</a> 存在于html文件中(保存在磁盘中) 没有 使用BeautifulSoup或其他库。只是初学者的Python代码。

    http://..filepath/filename.pdf
    File's Name
    so on...
    

    我能够单独提取和打印所有url或所有名称,但我无法在标记前的代码中附加一段时间后的所有名称,并在每个url下方打印它们。我的代码变得凌乱,而且我的代码相当堆叠。 到目前为止,这是我的代码:

    import os
    with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
        txt = html.read()
    # for urls
    nolp = 0
    urlarrow = []
    while nolp == 0:
        pos = txt.find("href")
        if pos >= 0:
          txtcount = len(txt)
          txt = txt[pos:txtcount]
          pos = txt.find('"')
          txtcount = len(txt)
          txt = txt[pos+1:txtcount]
          pos = txt.find('"')
          url = txt[0:pos]
          if url.startswith("http") and url.endswith("pdf"):
              urlarrow.append(url)
        else:
          nolp = 1
    for item in urlarrow:
      print(item)
    
    #for names
    almost identical code to the above
    
    html.close()
    

    如何使它工作?我需要将它们合并为一个函数或定义,但如何? 附言:我在下面发布了一个答案,但我认为可能有一个更简单的解决方案

    1 回复  |  直到 8 年前
        1
  •  0
  •   Yannis Dran    8 年前

    这是我需要的正确输出,但我相信有更好的方法。

    import os
    with open ('~/SomeFolder/page.html'),'r') as html:
        txt = html.read()
        text = txt
    #for urls    
    nolp = 0
    urlarrow = []
    while nolp == 0:
        pos = txt.find("href")
        if pos >= 0:
          txtcount = len(txt)
          txt = txt[pos:txtcount]
          pos = txt.find('"')
          txtcount = len(txt)
          txt = txt[pos+1:txtcount]
          pos = txt.find('"')
          url = txt[0:pos]
          if url.startswith("http") and url.endswith("pdf"):
              urlarrow.append(url)
        else:
          nolp = 1
    
    with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
        text = html.read()
    
    #for names  
    noloop = 0
    namearrow = []
    while noloop == 0:
        posB = text.find("title")
        if posB >= 0:
          textcount = len(text)
          text = text[posB:textcount]
          posB = text.find('"')
          textcount = len(text)
          text = text[posB+19:textcount] #because string starts 19 chars after the posB
          posB = text.find('</')
          name = text[1:posB]
          if text[0].startswith('>'):
              namearrow.append(name)
        else:
          noloop = 1
    
    fullarrow = []
    for pair in zip(urlarrow, namearrow):
        for item in pair:
            fullarrow.append(item)
    for instance in fullarrow:
        print(instance)
    
    html.close()