代码之家 › 专栏 › 技术社区 › John Bartholomew

使用xpath(in.net)在两个标记之间选择(兄弟)

xpath .net

John Bartholomew · 技术社区 · 15 年前

我用的是.NET 3.5(c)和 HTML Agility Pack 做一些网络抓取。我需要提取的一些字段被构造为段落,其中的组件由换行标记分隔。我希望能够在换行符之间选择单独的组件。每个组件可以由多个元素组成(即,它可能不仅仅是一个字符串)。例子:

<h3>Section title</h3>
<p>
  <b>Component A</b><br />
  Component B <i>includes</i> <strong>multiple elements</strong><br />
  Component C
</p>

我想选出来

<b>Component A</b>

然后:

Component B <i>includes</i> <strong>multiple elements</strong>

然后:

Component C

可能还有更多( <br /> 分离的)组件。

我可以很容易地得到第一个组件:

p/br[1]/preceding-sibling::node()

我还可以很容易地获得最后一个组件:

p/br[2]/following-sibling::node()

但是,我还没有弄清楚如何提取节点集/节点之间/其他两个标记(即,节点是兄弟节点,但在节点x之前,在节点y之后)。

另一种方法是手动扫描元素——如果这是最简单的方法,那么这就是我要做的,但到目前为止,xpath的简洁性给我留下了深刻的印象,所以我希望也有一种方法可以做到这一点。

编辑

由于我需要处理拥有3个以上组件的情况,因此,答案似乎至少需要多个XPath调用,因此我将基于此继续进行解决方案(这是我“接受”的答案)。Aakashm的回答也帮助我理解了xpath,所以我投票通过了。

谢谢大家的帮助!希望有一天我能报答你的好意。

编辑2

Dimitre Novatchev提供的新答案,经过了一些调整,确实能够正确工作。

解决方案:

int i = 0;
do
{
    yield return para.SelectNodes(String.Format(
        "node()[not(self::br) and count(preceding-sibling::br) = {0}]", i));
    ++i;
} while (para.SelectSingleNode(String.Format("br[{0}]", i)) != null);

我应该注意到,由于重复的XPath查询来查明是否还有更多的查询,所以这个循环有点低效。 br 标签。在我的例子中,效率低下不是一个问题,但是如果您想在其他情况下使用这个答案(同样,如果您确实想在性能敏感的情况下使用这个答案,那么您可能应该手动扫描,而不是使用xpath)。

和完整测试代码(aakashm提供的测试代码的修改版本):

using System;
using System.Collections.Generic;
using System.Xml;

namespace TestXPath
{
    class Program
    {
        static void Main(string[] args)
        {
            XmlDocument doc = new XmlDocument();
            doc.LoadXml(@"
<x>
 <h3>Section title</h3>
 <p>
  <b>Component A</b><br />
  Component B <i>includes</i> multiple <strong>elements</strong><br />
  Component C
 </p>
</x>
            ");


            foreach (var nodes in SplitOnLineBreak(doc.SelectSingleNode("x/p")))
            {
                Dump(nodes);
                Console.WriteLine();
            }

            Console.ReadLine();
        }

        private static IEnumerable<XmlNodeList> SplitOnLineBreak(XmlNode para)
        {
            int i = 0;
            do
            {
                yield return para.SelectNodes(String.Format(
                    "node()[not(self::br) and count(preceding-sibling::br) = {0}]", i));
                ++i;
            } while (para.SelectSingleNode(String.Format("br[{0}]", i)) != null);
        }

        private static void Dump(XmlNodeList nodes)
        {
            foreach (XmlNode node in nodes)
            {
                Console.WriteLine(string.Format("-->{0}<---", 
                                  node.OuterXml));                    
            }
        }
    }
}

4 回复 | 直到 15 年前

Dimitre Novatchev 15 年前

这可以用xpath 2.0或由xslt承载的xpath 1.0轻松完成。 .

通过.NET托管的xpath 1.0,可以通过以下几个步骤实现:

使相应的“P”节点成为当前节点。
找到所有号码 <br /> 当前“p”节点的子节点:

伯爵(伯爵)
如果n是计数,在步骤2中确定。对于美元在里面 0 到 N 任务:

3.1查找前面有美元 <BR/GT; 元素:

node()[非(self::br)和count(preceding::br)=$k]

3.2对于找到的每个此类节点,获取其字符串值

3.3连接步骤3.2中获得的所有字符串值。 这个连接的结果是给定段落中包含的所有文本 .

注释 :为了代替应该代表什么 $k 在步骤3.1中,需要动态构造此表达式。

AakashM 15 年前

如果在你的情况下,你总是有三个“碎片”,由 br s,您可以使用此xpath获取中间的“部分”:

//node()[preceding::br and following::br]

其中使用 preceding 和 following 返回两个节点之间的所有节点的轴 溴 S,任何地方。

编辑这是我的测试应用程序(请原谅 XmlDocument ,我仍在使用.NET 2.0…)

using System;
using System.Xml;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            XmlDocument doc = new XmlDocument();
            doc.LoadXml(@"
<x>
 <h3>Section title</h3>
 <p>
  <b>Component A</b><br />
  Component B <i>includes</i> <strong>multiple elements</strong><br />
  Component C
 </p>
</x>
            ");

            XmlNodeList nodes = doc.SelectNodes(
                "//node()[preceding::br and following::br]");

            Dump(nodes);

            Console.ReadLine();
        }

        private static void Dump(XmlNodeList nodes)
        {
            foreach (XmlNode node in nodes)
            {
                Console.WriteLine(string.Format("-->{0}<---", 
                                  node.OuterXml));                    
            }
        }
    }
}

这就是输出:

-->
      Component B <---
--><i>includes</i><---
-->includes<---
--><strong>multiple elements</strong><---
-->multiple elements<---

如你所见,你得到一个 XmlNodeList 所有的东西都在 溴 S.

我的想法是:这个xpath返回任何节点,只要 对于那个节点 ,前一个轴包含 溴 , 和以下轴包含 溴 .

LorenVS 15 年前

如何:

p/*[not(local-name()='br')]

然后用你想要的任何术语索引这个表达式。

编辑:

对于您的索引问题:

p/*[not(local-name()='br') and position() < x and position() > y]

John Fisher 15 年前

尝试使用position()或count()方法。这里有一个猜测那可能帮助你得到了正确的语法。

p/*[position() > position(/p/br[1]) and position() < position(/p/br[2])]

编辑: 投票或评论前请阅读评论 .