代码之家 › 专栏 › 技术社区 › mark

从与给定字符串匹配对应的文本中提取所需值

python-re text beautifulsoup string python

0

mark · 技术社区 · 2 年前

我有下面的字符串。

*******************************************************************************
*                                                                             *
*                         int. normalized  values  of  :                      *
*                         ---------------------------                         *
*                      % of irradiance at ground level                        *
*     % of direct  irr.    % of diffuse irr.    % of enviro. irr              *
*               0.488               0.418               0.093                 *
*                       reflectance at satellite level                        *
*     atm. intrin. ref.   background  ref.  pixel  reflectance                *
*               0.127               0.146               0.170                 *
*                                                                             *
*                         int. absolute values of                             *
*                         -----------------------                             *
*                      irr. at ground level (w/m2/mic)                        *
*     direct solar irr.    atm. diffuse irr.    environment  irr              *
*             592.299             507.010             113.283                 *
*                      rad at satel. level (w/m2/sr/mic)                      *
*     atm. intrin. rad.    background  rad.    pixel  radiance                *
*              58.837              67.355              78.685                 *
*                                                                             *
*                                                                             *
*                      sol. spect (in w/m2/mic)                               *
*                                2054.457                                     *
*                                                                             *
*******************************************************************************

我试图提取与“直接太阳能反射率”、“大气扩散反射率”和“环境反射率”相对应的值。

import re

def extract_values(text):
    pattern = r"direct solar irr\.\s*atm. diffuse irr\.\s*environment irr\s*([\d\.]+)\s*([\d\.]+)\s*([\d\.]+)"
    match = re.search(pattern, text)
    if match:
       return {
            "direct solar irr.": match.group(1),
            "atm. diffuse irr.": match.group(2),
            "environment irr.": match.group(3)
        }

但它不会产生任何结果。

有人能帮我吗?

编辑:尝试使用BeautifulSoup:

enter code here def extract_values(文本):

soup = BeautifulSoup(text, 'html.parser')

# Get all text elements
lines = [line.strip() for line in soup.get_text().splitlines() if line.strip() != ""]

# Identify the line after the "direct solar irr." label
for i, line in enumerate(lines):
    if "direct solar irr." in line:
        # Look for the next line with a number
        for subsequent_line in lines[i+1:]:
            if re.search(r'\d', subsequent_line):  # Check if the line has a digit
                values = subsequent_line.split()
                return {
                    "direct solar irr.": float(values[0]),
                    "atm. diffuse irr.": float(values[1]),
                    "environment irr.": float(values[2])
                }

direct_solar_irr=extract_values(文本) 直接太阳能反射镜。“:float(values[0]), ValueError:无法将字符串转换为浮点值:“*”

1 回复 | 直到 2 年前

1

2

chrisfang 2 年前

这是我设计的规则图案。应用这个例子是正确的,但我不确定它是否可以应用于您的特定场景。这实际上是一个常规的匹配问题,而不是Python代码问题。

text = """
*******************************************************************************
*                                                                             *
*                         int. normalized  values  of  :                      *
*                         ---------------------------                         *
*                      % of irradiance at ground level                        *
*     % of direct  irr.    % of diffuse irr.    % of enviro. irr              *
*               0.488               0.418               0.093                 *
*                       reflectance at satellite level                        *
*     atm. intrin. ref.   background  ref.  pixel  reflectance                *
*               0.127               0.146               0.170                 *
*                                                                             *
*                         int. absolute values of                             *
*                         -----------------------                             *
*                      irr. at ground level (w/m2/mic)                        *
*     direct solar irr.    atm. diffuse irr.    environment  irr              *
*             592.299             507.010             113.283                 *
*                      rad at satel. level (w/m2/sr/mic)                      *
*     atm. intrin. rad.    background  rad.    pixel  radiance                *
*              58.837              67.355              78.685                 *
*                                                                             *
*                                                                             *
*                      sol. spect (in w/m2/mic)                               *
*                                2054.457                                     *
*                                                                             *
*******************************************************************************
"""

import re

def extract_values(text):
    # pattern = r"direct solar irr\.\s*atm. diffuse irr\.\s*environment irr\s*([\d\.]+)\s*([\d\.]+)\s*([\d\.]+)"
    pattern = r"direct solar irr\.\s*atm\. diffuse irr\.\s*environment  irr.*\n.*?\s*([\d\.]+)\s*([\d\.]+)\s*([\d\.]+)"
    match = re.search(pattern, text)
    # print(match.groups())
    if match and match.groups():
       return {
            "direct solar irr.": match.group(1),
            "atm. diffuse irr.": match.group(2),
            "environment irr.": match.group(3)
        }


if __name__ == '__main__':
    data = extract_values(text)
    print(data)

需要找到规则, 你的数据肯定会在这个字符串下面吗? 我知道,如果它真的在下面,它可以通过这种方式提取。即使您有一个包含多行的文本,也可以通过这种方式提取这组数据,但唯一需要考虑的是性能问题,这可能需要更精确的场景。

    lines = text.split('\n')
    data_index = [i for i, line in enumerate(lines) if 'direct solar irr.    atm. diffuse irr.    environment  irr' in line]
    data_index = data_index[0] if data_index else None
    if data_index is None:
        raise ValueError
    value_line = lines[data_index+1].strip("*").strip()
    for v in value_line.split():
        print(v)

2

1

Xukrao 2 年前

使用以下正则表达式模式:

pattern = r"direct solar irr\.\s*atm. diffuse irr\.\s*environment  irr\s*\*\n\*\s*([\d\.]+)\s*([\d\.]+)\s*([\d\.]+)"

看见 regex demo here 以获得详细的细分。