抓取A股流通市值的WebScraper代码

想学习WebScraper时写了一个脚本,抓取的是新浪财经的A股流通市值排名,为数据分析做准备。

{“_id”:”xinglang-caijing”,”startUrl”:[“http://vip.stock.finance.sina.com.cn/datacenter/hqstat.html#ltsz”],”selectors”:[{“id”:”page”,”type”:”SelectorElementClick”,”parentSelectors”:[“_root”],”selector”:”div.grey_border”,”multiple”:false,”delay”:”2000″,”clickElementSelector”:”div.pages a:nth-of-type(n+2)”,”clickType”:”clickOnce”,”discardInitialElements”:false,”clickElementUniquenessType”:”uniqueText”},{“id”:”every”,”type”:”SelectorElement”,”parentSelectors”:[“page”],”selector”:”tbody tr”,”multiple”:true,”delay”:0},{“id”:”code”,”type”:”SelectorText”,”parentSelectors”:[“every”],”selector”:”th.colorize:nth-of-type(1) a”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”name”,”type”:”SelectorText”,”parentSelectors”:[“every”],”selector”:”th.colorize:nth-of-type(2) a”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”close”,”type”:”SelectorText”,”parentSelectors”:[“every”],”selector”:”td:nth-of-type(1)”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”change”,”type”:”SelectorText”,”parentSelectors”:[“every”],”selector”:”td:nth-of-type(2)”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”liutong”,”type”:”SelectorText”,”parentSelectors”:[“every”],”selector”:”td.colorize.sort_down”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”all”,”type”:”SelectorText”,”parentSelectors”:[“every”],”selector”:”td.colorize:nth-of-type(6)”,”multiple”:false,”regex”:””,”delay”:0}]}

以上代码测试为JSON格式错误,附上最新代码:

{“_id”:”xinglang-caijing”,”startUrl”:[“http://vip.stock.finance.sina.com.cn/datacenter/hqstat.html#ltsz”],”selectors”:[{“id”:”page”,”type”:”SelectorElementClick”,”parentSelectors”:[“_root”],”selector”:”tr.red”,”multiple”:true,”delay”:0,”clickElementSelector”:”div.pages a:nth-of-type(n+2), span.pagecurr”,”clickType”:”clickOnce”,”discardInitialElements”:false,”clickElementUniquenessType”:”uniqueText”},{“id”:”code”,”type”:”SelectorText”,”parentSelectors”:[“page”],”selector”:”th.colorize:nth-of-type(1)”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”name”,”type”:”SelectorText”,”parentSelectors”:[“page”],”selector”:”th.colorize:nth-of-type(2)”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”close”,”type”:”SelectorText”,”parentSelectors”:[“page”],”selector”:”td:nth-of-type(1)”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”per-chang”,”type”:”SelectorText”,”parentSelectors”:[“page”],”selector”:”td:nth-of-type(2)”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”vol”,”type”:”SelectorText”,”parentSelectors”:[“page”],”selector”:”td.colorize:nth-of-type(3)”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”market”,”type”:”SelectorText”,”parentSelectors”:[“page”],”selector”:”td.colorize.sort_down”,”multiple”:false,”regex”:””,”delay”:0},{“id”:”all”,”type”:”SelectorText”,”parentSelectors”:[“page”],”selector”:”td.colorize:nth-of-type(6)”,”multiple”:false,”regex”:””,”delay”:0}]}

经过测试,抓取结果不完全,大约2500条,完全抓取应该是3800条左右。

每次抓取返回数据量不一样,估计是被新浪财经反爬了。

奇怪了,代码复制回浏览器插件,就会报错!不知道是哪里的问题……

微信公众号:Digiccy数据信息
关注我们,获取更多有价值的数据!
1200人已关注
分享到:
赞(2)