vue视频教程 字节跳动 Quartz webstorm汉化包 Java Out Of Memory unity3d cakephp Movejs vue架构 ppt视频教程下载 多店版微信商城 jquery循环 jquery去空格 华为路由器ipv6配置 时间戳java mysql分页查询sql语句 plsql连接mysql 配置python环境 windows搭建python开发环境 python中import python代码 python编程语言 linux命令 离散数学及其应用 简体中文语言包 千元以下最好的手机 马赛克软件 影视后期软件 丁丁下载 系统维护工具 淘宝自动发货软件 地下城怎么双开 assist是什么意思 win10有几个版本 存储过程写法 lol卡米尔 电脑代码雨 babelrc 淘宝抽奖活动 fireworks
当前位置: 首页 > 学习教程  > 编程语言

改良获取股票吧评论一次一万条

2020/11/4 14:02:30 文章标签:

1.自己弄了几个稳定的ip代理,一个ip代理池 import re import xlwt from urllib import parse import requests from lxml import etree import random def main():import timestart_time time.time()b 0columns [网址, 作者, 发布时间, 标题, 内容, 阅读量, 评论…

1.自己弄了几个稳定的ip代理,一个ip代理池

import re
import xlwt
from urllib import parse
import requests
from lxml import etree
import random
def main():
    import time
    start_time = time.time()
    b = 0
    columns = ['网址', '作者', '发布时间', '标题', '内容', '阅读量', '评论数']
    workbook = xlwt.Workbook(encoding="utf-8")
    worksheet = workbook.add_sheet('My Worksheet')
    for k in range(len(columns)):
        worksheet.write(0, k, columns[k])
        workbook.save('movie.xls')
    a = int(input("请输入开始的页数:"))
    k = int(input("请输入结束的页数:"))
    n=input("请输入股吧的代号:")
    for i in range(a+1,k+1):
        proxy_list=['https://114.98.144.34','https://113.218.243.142','https://114.100.61.13','https://114.233.219.49','https://119.132.72.255','https://1.196.116.113']
        proxies = random.choice(proxy_list)  # 随机选择一个网址
        proxies={'https':proxies}
        print(proxies)
        url = "http://guba.eastmoney.com/list,{},f_{}.html".format(n,i)
        headers = {'Referer': "http://guba.eastmoney.com",
                   'User-Agent':
                       'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0)'
                       'Gecko/20100101 Firefox/80.0',
                   }
        response0 = requests.get(url=url,headers=headers,proxies=proxies)
        news_comment_urls0 = re.findall(r'/news,{},\S+html'.format(n), response0.text)
        html = etree.HTML(response0.text)  # 解析成html
        read_num = html.xpath(
            '//*[@id="articlelistnew"]/div/span[3]/a[contains(@href,"new")]/../../span[1]/text()')  # 阅读数
        comment_num = html.xpath(
            '//*[@id="articlelistnew"]/div/span[3]/a[contains(@href,"new")]/../../span[2]/text()')  # 评论数
        a = 0
        for comment_url0 in news_comment_urls0:
            worksheet.write(b + 1, 5, read_num[a])
            worksheet.write(b + 1, 6, comment_num[a])
            a+=1
            workbook.save('movie.xls')
            b += 1
            list_url = "http://guba.eastmoney.com"
            whole_url0 = parse.urljoin(list_url, comment_url0)
            print(whole_url0)
            worksheet.write(b, 0, whole_url0)
            workbook.save('movie.xls')
            response1 = requests.get(whole_url0)
            name = re.findall('<font>(.*?)</font>', response1.text)
            for nam in name:
                print(nam)
                worksheet.write(b, 1, name)
                workbook.save('movie.xls')
            tim = re.findall('<div class="zwfbtime">(.*?)</div>', response1.text)
            tim = str(tim)
            tim1 = re.findall('\d\d\d\d-\d\d-\d\d', tim)
            for second in tim1:
                print(second)
                worksheet.write(b, 2, second)
                workbook.save('movie.xls')
            title = re.findall('<title>(.*?)</title>', response1.text)
            title = str(title)
            title1 = re.findall(r".*'(.*?)_.*股吧']", title)
            for j in title1:
                print(j)
                worksheet.write(b, 3, j)
                workbook.save('movie.xls')
            content = re.findall('<div class="stockcodec .xeditor">(.*?)</div>', response1.text, re.DOTALL)
            for i in content:
                m = re.sub("[a-zA-Z<>_=#\"/]", '', i)
                print(m)
                worksheet.write(b, 4, m)
                workbook.save('movie.xls')
            print(read_num[a - 1])
            print(comment_num[a - 1])
    end_time = time.time()
    print(f"总共需要时间为: {end_time - start_time} s")
    print(b)
if __name__ == '__main__':
    main()

在这里插入图片描述


本文链接: http://www.dtmao.cc/news_show_350064.shtml

附件下载

上一篇:Vue学习笔记整理

下一篇:343. 整数拆分

相关教程

    暂无相关的数据...

共有条评论 网友评论

验证码: 看不清楚?