代码之家  ›  专栏  ›  技术社区  ›  user8314628

当xpath在Chrome控制台中工作时,Scrapy shell无法抓取信息

  •  0
  • user8314628  · 技术社区  · 8 年前

    教授页面是动态的。我通过Chrome网络找到了这个请求。然而,当scrapy xpath在浏览器上工作时,它在scrapy shell中不工作。我甚至尝试添加标题。 scrapy shell result

    Chrome console result

    import scrapy
    from universities.items import UniversitiesItem
    
    
    class UniversityOfHouston(scrapy.Spider):
        name = 'University_of_Houston'
        allowed_domains = ['uh.edu']
        start_urls = ['http://www.uh.edu/directory/']
    
        def __init__(self):
            self.lastName = ''
    
        def parse(self, response):
            self.lastName = 'An'
            query = "http://www.uh.edu/directory/proxy.php?q=" + self.lastName + \
                    "&submit=Search&limit=250&loc=HR730&pos=faculty%7Cstaff&faculty=faculty&staff=staff&student=student"
    
            yield scrapy.Request(query, callback=self.parse_staff)
    
        def parse_staff(self, response):
            results = response.xpath('//dt/a/@href').extract()
            for result in results:
                query = 'http://www.uh.edu/directory/' + result
                yield scrapy.Request(query, callback=self.parse_item)
    
        def parse_item(self, response):
    
            item = UniversitiesItem()
    
            item['full_name'] = response.xpath('//h2[@class="single_title"]/text()').extract_first()
            item['university'] = 'University of Houston'
            item['discipline'] = response.xpath('//td/a[@class="org"]/text()').extract_first()
            item['title'] = response.xpath('//tr/td[@class="title"]/text()')
            item['email'] = response.xpath('//td/a[@title="email address"]/text()').extract_first()[7:]
            item['phone'] = response.xpath('//td[@class="tel"]/a/text()').extract_first()
    
            yield item
    

    测试版本:

    import scrapy
    from universities.items import UniversitiesItem
    
    
    class UniversityOfHouston(scrapy.Spider):
        #name = 'University_of_Houston'
        name = 'uh2'
        allowed_domains = ['uh.edu']
        start_urls = ['http://www.uh.edu/directory/']
    
        def __init__(self):
            self.last_name = ''
    
        def parse(self, response):
            with open('kw.txt') as file_object:
                last_names = file_object.readlines()
    
            for ln in ['Lee', 'Zhao']:
                self.last_name = ln.strip()
                print('-----------------------------------------------------')
                print("scraping last name: ", self.last_name)
                query = "http://www.uh.edu/directory/proxy.php?q=" + self.last_name + \
                        "&submit=Search&limit=250&loc=HR730&pos=faculty%7Cstaff&faculty=faculty&staff=staff&student=student"
    
                yield scrapy.Request(query, callback=self.parse_staff)
    
        def parse_staff(self, response):
            results = response.xpath('//dt/a/@href').extract()
            for result in results:
                query_proxy = 'http://www.uh.edu/directory/' + result.replace("index.php", "proxy.php")
                yield scrapy.Request(query_proxy, callback=self.parse_item)
    
        def parse_item(self, response):
            full_name = response.xpath('//h2[@class="single_title"]/text()').extract_first()
            if full_name:
                if self.last_name in full_name.split():
                    item = UniversitiesItem()
                    item['fullname'] = full_name
                    # last_name = full_name.split()[-1]
                    # item['lastname'] = last_name
                    # item['firstname'] = full_name[:-len(last_name)].strip()
                    item['university'] = 'University of Houston'
                    try:
                        item['department'] = response.xpath('//td/a[@class="org"]/text()').extract_first()
                        item['title'] = response.xpath('//tr/td[@class="title"]/text()').extract_first()
                        item['email'] = response.xpath('//td/a[@title="email address"]/text()').extract_first()
                        item['phone'] = response.xpath('//td[@class="tel"]/a/text()').extract_first()
                    except ValueError:
                        pass
    
                    yield item
    
    1 回复  |  直到 8 年前
        1
  •  1
  •   Tarun Lalwani    8 年前

    问题是因为数据是使用网页上的AJAX调用获取的。当您获取主页时,数据不可用

    AJAX Call

    更改您的 parse_staff

    def parse_staff(self, response):
        results = response.xpath('//dt/a/@href').extract()
        for result in results:
            query = 'http://www.uh.edu/directory/' + result
            query_proxy = "https://ssl.uh.edu/directory/" + result.replace("index.php", "proxy.php")
            yield response.follow(query_proxy, callback=self.parse_item)