代码之家 › 专栏 › 技术社区 › Yannis Dran

提取存储在磁盘上的html文件的url和名称,并分别打印它们-Python

extract html-parsing python

-1

Yannis Dran · 技术社区 · 8 年前

我试图提取和打印URL及其名称(在 <a href='url' title='smth'>NAME</a> 存在于html文件中(保存在磁盘中) 没有使用BeautifulSoup或其他库。只是初学者的Python代码。

http://..filepath/filename.pdf
File's Name
so on...

我能够单独提取和打印所有url或所有名称,但我无法在标记前的代码中附加一段时间后的所有名称,并在每个url下方打印它们。我的代码变得凌乱,而且我的代码相当堆叠。到目前为止,这是我的代码:

import os
with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
    txt = html.read()
# for urls
nolp = 0
urlarrow = []
while nolp == 0:
    pos = txt.find("href")
    if pos >= 0:
      txtcount = len(txt)
      txt = txt[pos:txtcount]
      pos = txt.find('"')
      txtcount = len(txt)
      txt = txt[pos+1:txtcount]
      pos = txt.find('"')
      url = txt[0:pos]
      if url.startswith("http") and url.endswith("pdf"):
          urlarrow.append(url)
    else:
      nolp = 1
for item in urlarrow:
  print(item)

#for names
almost identical code to the above

html.close()

如何使它工作?我需要将它们合并为一个函数或定义,但如何? 附言:我在下面发布了一个答案,但我认为可能有一个更简单的解决方案

1 回复 | 直到 8 年前

Yannis Dran 8 年前

这是我需要的正确输出,但我相信有更好的方法。

import os
with open ('~/SomeFolder/page.html'),'r') as html:
    txt = html.read()
    text = txt
#for urls    
nolp = 0
urlarrow = []
while nolp == 0:
    pos = txt.find("href")
    if pos >= 0:
      txtcount = len(txt)
      txt = txt[pos:txtcount]
      pos = txt.find('"')
      txtcount = len(txt)
      txt = txt[pos+1:txtcount]
      pos = txt.find('"')
      url = txt[0:pos]
      if url.startswith("http") and url.endswith("pdf"):
          urlarrow.append(url)
    else:
      nolp = 1

with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
    text = html.read()

#for names  
noloop = 0
namearrow = []
while noloop == 0:
    posB = text.find("title")
    if posB >= 0:
      textcount = len(text)
      text = text[posB:textcount]
      posB = text.find('"')
      textcount = len(text)
      text = text[posB+19:textcount] #because string starts 19 chars after the posB
      posB = text.find('</')
      name = text[1:posB]
      if text[0].startswith('>'):
          namearrow.append(name)
    else:
      noloop = 1

fullarrow = []
for pair in zip(urlarrow, namearrow):
    for item in pair:
        fullarrow.append(item)
for instance in fullarrow:
    print(instance)

html.close()

推荐文章

Essi · R-基于匹配值从另一个数据帧添加数据[重复]

7 年前

wen tian · 使用beautifulsoup从网站中提取数字?

7 年前

user7579444 · 在Python中,如何获取相同字符的数量及其在字符串中的位置?

7 年前

Ty Kayn · PHP7中的ZipArchive找不到zip的内容

7 年前

YazOT · 使用python从文本文件中提取特定行

7 年前

plaidshirt · JMeter JSON提取器按条件获取值

7 年前

Pau · 从字符串中提取超链接的Php函数

7 年前

kroy2008 · 从选定尾注生成的字符串中提取文本

7 年前

Fabio Favoretto · 在R中匹配不同数据帧中的站点

7 年前

hoperose · 如何使用python中的正则表达式从文件中提取特定段落?

7 年前