代码之家 › 专栏 › 技术社区 › boyenec

python如何在使用urlparse时获得任何特定url的一部分?

urlparse python-re urllib2 python-3.x python

-1

boyenec · 技术社区 · 3 年前

我有一个这样的网址

url = 'https://grabagun.com/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'

当我使用 urlparse() 函数,我得到的结果如下:

>>> url = urlparse(url) 
>>> url.path
'/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'

有可能得到这样的东西吗:

path1=“火器”
path2=“手枪”
path3=“半自动手枪”

我不想得到任何末尾有“.html”的文本。

4 回复 | 直到 3 年前

Bhargav 3 年前

你有一些单身 / 一些路径已经 // …如果您想直接应用于URL,请先将所有内容替换为相同内容。对于 url.path 你可以直接做

url = '/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'

url = url.split('/')
url = list(filter(None, url))#remove empty elemnt
url.pop()
print(url)

输出列表#

['firearms', 'handguns', 'semi-automatic-handguns']

第2部分

如果你想让它们可变,那么只需对它们进行迭代并创建变量

for n, val in enumerate(url):
    globals()["path%d"%n] = val

print(path1)

输出#

handguns

arielkaluzhny 3 年前

path_list = url.path.split('/')

if ".html" in path_list[-1]:
    path_list = path_list[:-1]

将为您提供一个列表,每个部分都作为一个条目,如果最后一个包含“.html”,则将其排除在外。

根据您想要的具体内容或您的用例的具体/通用程度,您可以编辑此。

imxitiz HANY Gh 3 年前

您的问题的一个线性解决方案可能是:

path=urlparse(url).path[1:]

splittedpath=[sp for sp in path.split("/") if not sp.endswith(".html")]
"""
['firearms', 'handguns', 'semi-automatic-handguns']
"""

您可以通过以下方式访问这些:

print(splittedpath[0]) # 0,1,2... 
# firearms

我们在这里所做的是,通过执行删除路径的第一个字符串,该字符串为“/” path.path[1:] ,使用从每次出现的“/”中分割字符串路径 .split("/") 并检查拆分后的字符串是否以“.html”结尾,如果不保存它。

dsds 3 年前

是的,可以使用Python的urlparse模块提取像这样的URL的各个路径组件。

以下是一种方法:

from urllib.parse import urlparse

url = 'https://grabagun.com/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'

parsed_url = urlparse(url)

path = parsed_url.path

path_components = path.split('/')

# remove the empty string at the beginning of the list
path_components = path_components[1:]

# remove the last element if it ends with '.html'
if path_components[-1].endswith('.html'):
  path_components = path_components[:-1]

print(path_components)
# Output: ['firearms', 'handguns', 'semi-automatic-handguns']

此代码首先使用urlparse解析URL,然后使用split方法拆分URL的路径组件。它删除列表开头的空字符串,如果最后一个元素以“.html”结尾,则删除该元素。生成的列表将包含URL的各个路径组件。

bener07 3 年前

你可以把它们放在一个数组中,用/

url.path.split('/')

如果你想把它们放在path1、path2等等中,你可以把列表中的值分配给变量。

path1, path2, path3 = url.path.split('/')[:3]

我放它只是为了得到列表的前3个值。如果你不想使用.html的文本,你可以总是获得最后一个值的索引,并在列表切片中使用它,如下所示。

paths = url.path.split('/')
if '.html' in paths[-1]:
    html_text_index = paths.index(paths[-1])
no_html_paths = paths[:html_text_index]