代码之家  ›  专栏  ›  技术社区  ›  MishaVacic

分析Python队列对象

  •  0
  • MishaVacic  · 技术社区  · 8 年前

    我在思考代码中的问题所在

    from queue import Queue
    from threading import Thread
    from html.parser import HTMLParser
    import urllib.request
    
    hosts = ["http://yahoo.com", "http://google.com", "http://ibm.com"]
    
    queue = Queue()
    
    class ThreadUrl(Thread):
       def __init__(self, queue):
           Thread.__init__(self)
           self.queue = queue
    
       def run(self):
          while True:
             host = self.queue.get()
             url=urllib.request.urlopen(host)
             url.read(4096)
             self.queue.task_done()
    
    
    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print("Start tag:", tag)
            for attr in attrs:
                print("     attr:", attr)
    
    
    
    def consumer():
        for i in range(3):
            t = ThreadUrl(queue)
            t.setDaemon(True)
            t.start()
    
        for host in hosts:
            parser = MyHTMLParser()
            parser.feed(host)
            queue.put(host) 
        queue.join()
    
    consumer()
    

    我的目标是提取URL的内容,读取队列并最终解析它。当我执行代码时,它不会打印任何内容。我应该把解析器放在哪里?

    1 回复  |  直到 8 年前
        1
  •  1
  •   lcastillov    8 年前

    下面是一个示例:

    from queue import Queue
    from threading import Thread
    from html.parser import HTMLParser
    import urllib.request
    
    
    NUMBER_OF_THREADS = 3
    
    
    HOSTS = ["http://yahoo.com", "http://google.com", "http://ibm.com"]
    
    
    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print("Start tag:", tag)
            for attr in attrs:
                print("\tattr:", attr)
    
    
    class ThreadUrl(Thread):
       def __init__(self, queue):
           Thread.__init__(self)
           self.queue = queue
    
       def run(self):
           while True:
               host = self.queue.get()
               url = urllib.request.urlopen(host)
               content = str(url.read(4096))
               parser = MyHTMLParser()
               parser.feed( content )
               self.queue.task_done()
    
    
    def consumer():
        queue = Queue()
        for i in range(NUMBER_OF_THREADS):
            thread = ThreadUrl(queue)
            thread.setDaemon(True)
            thread.start()
        for host in HOSTS:
            queue.put(host) 
        queue.join()
    
    
    if __name__ == '__main__':
        consumer()