代码之家  ›  专栏  ›  技术社区  ›  showkey

如何自动获取json数据,而不是手动复制和粘贴?

  •  -1
  • showkey  · 技术社区  · 4 年前

    我想在目标url中获取json数据:
    target url

    手动获取:在浏览器中手动打开,然后复制、粘贴。我想要一种更简单的方法——以编程和自动的方式,尝试了几种方法,都失败了。
    方法1——wget或curl的传统方法:

    wget  https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300
    --2021-02-09 11:55:44--  https://xueqiu.com/stock/cata/stocktypelist.json?page=1
    Resolving xueqiu.com (xueqiu.com)... 39.96.249.191
    Connecting to xueqiu.com (xueqiu.com)|39.96.249.191|:443... connected.
    HTTP request sent, awaiting response... 403 Forbidden
    2021-02-09 11:55:44 ERROR 403: Forbidden.
    

    方法2——含硒刮泥:

    >>> from selenium import webdriver
    >>> browser = webdriver.Chrome()
    >>> url="https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300"
    >>> browser.get(url)
    

    我在浏览器中遇到了这样的情况:

    {"error_description":"遇到错误,请刷新页面或者重新登录帐号后再试","error_uri":"/stock/cata/stocktypelist.json","error_code":"400016"}
    

    方法3——构建mitmproxy:

    mitmweb   --listen-host  127.0.0.1  -p  8080
    

    在浏览器中设置代理并在浏览器中打开目标url

    终端中的错误信息:

    Web server listening at http://127.0.0.1:8081/
    Opening in existing browser session.
    Proxy server listening at http://127.0.0.1:8080
    127.0.0.1:41268: clientconnect
    127.0.0.1:41270: clientconnect
    127.0.0.1:41268: HTTP/2 connection terminated by client: error code: 0, last stream id: 0, additional data: None
    

    浏览器中的错误信息:

    error_description   "遇到错误,请刷新页面或者重新登录帐号后再试"
    error_uri   "/stock/cata/stocktypelist.json"
    error_code  "400016"
    

    这么强大的网站来保护数据,难道没有办法自动获取数据吗?

    0 回复  |  直到 4 年前
        1
  •  1
  •   Samsul Islam Rute Figueiredo    4 年前

    你可以用 requests 单元

    import json
    
    import requests
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",}
    import requests
    
    cookies = {
        'xq_a_token': '176b14b3953a7c8a2ae4e4fae4c848decc03a883',
        'xqat': '176b14b3953a7c8a2ae4e4fae4c848decc03a883',
        'xq_r_token': '2c9b0faa98159f39fa3f96606a9498edb9ddac60',
        'xq_id_token': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYxMzQ0MzE3MSwiY3RtIjoxNjEyODQ5MDY2ODI3LCJjaWQiOiJkOWQwbjRBWnVwIn0.VuyNicSjIvVkp9FrCzIlRyx8487XM4HH1C3X9KsFA2FipFiilSifBhux9pMNRyziHHiEifhX-xOgccc8IG1mn8cOylOVy3b-L1YG2T5Hs8MKgx7qm4gnV5Mzm_5_G5BiNtO44aczUcmp0g53dp7-0_Bvw3RlwXzT1DTvCKTV-s_zfBsOPyFTfiqyDUxU-oBRvkz1GpgVJzJL4EmZ8zDE2PBqeW00ueLLC7qPW50WeDCsEFS4ZPAvd2SbX9JPk-lU2WzlcMck2S9iFYmpDwuTeQuPbSeSl6jt5suwTImSgJDIUP9o2TX_Z7nNRDTYxvbP8XlejSt8X0pRDPDd_zpbMQ',
        'u': '661612849116563',
        'device_id': '24700f9f1986800ab4fcc880530dd0ed',
        'Hm_lvt_1db88642e346389874251b5a1eded6e3': '1612849123',
        's': 'c111f3y1kn',
        'Hm_lpvt_1db88642e346389874251b5a1eded6e3': '1612849252',
    }
    
    headers = {
        'Connection': 'keep-alive',
        'Cache-Control': 'no-cache',
        'sec-ch-ua': '"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"',
        'sec-ch-ua-mobile': '?0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
        'Accept': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-Mode': 'no-cors',
        'Sec-Fetch-User': '?1',
        'Sec-Fetch-Dest': 'image',
        'Accept-Language': 'en-US,en;q=0.9',
        'Pragma': 'no-cache',
        'Referer': '',
    }
    
    params = (
        ('page', '1'),
        ('size', '300'),
    )
    
    response = requests.get('https://xueqiu.com/stock/cata/stocktypelist.json', headers=headers, params=params, cookies=cookies)
    print(response.status_code)
    json_data = response.json()
    print(json_data)
    
        2
  •  1
  •   basckerwil    4 年前

    你可以用 scrapy :

    import json
    
    import scrapy
    
    
    class StockSpider(scrapy.Spider):
        name = 'stock_spider'
        start_urls = ['https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300']
        custom_settings = {
            'DEFAULT_REQUEST_HEADERS': {
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:85.0) Gecko/20100101 Firefox/85.0',
                'Host': 'xueqiu.com',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Accept-Language': 'en-US',
                'Accept-Encoding': 'gzip,deflate,br',
                'Connection': 'keep-alive',
                'Cache-Control': 'no-cache',
                'Sec-Fetch-Dest': 'document',
                'Sec-Fetch-Mode': 'navigate',
                'Sec-Fetch-Site': 'none',
                'Sec-Fetch-User': '?1',
                'Upgrade-Insecure-Requests': '1',
                'Pragma': 'no-cache',
                'Referer': '',
            },
            'ROBOTSTXT_OBEY': False
        }
        handle_httpstatus_list = [400]
    
        def parse(self, response):
            json_result = json.loads(response.body)
            yield json_result
    
    

    跑步蜘蛛: scrapy crawl stock_spider