乐愚社区Beta

 Python  >  我爬取了知乎上大学相关话题中的热门高赞问答,其中是否有你大学生活的影子呢?

我爬取了知乎上大学相关话题中的热门高赞问答,其中是否有你大学生活的影子呢?

旁观者  L21  • 2021-03-01 • 回复 0 • 只看楼主举报    

你的大学生活过得怎么样?充实?有趣?有遗憾?本文我们使用 Python 爬取知乎上大学相关话题中的热门高赞问答,看看是否有你熟悉的场景。

爬取

首先,我们到知乎上搜一些大学相关话题中热度比较高的几个,如下图所示:

这个我们通过话题的关注人数、问题数量、精华内容等方面判断,接着我们用鼠标选中一个话题点进去,如下图所示:

我们要记录一下网址中的话题 ID,就是网址中 topic 后面那一串数字,这个在爬取时要用到。

接下来我们看一下爬取的实现,我们先导入需要用到的 Python 库,如下所示:

import re, json, random, requests, urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

爬取问答内容的具体实现代码如下所示:

def get_answers_by_page(topic_id, page_no):
    global db, answer_ids, maxnum
    limit = 10
    offset = page_no * limit
    url = "https://www.zhihu.com/api/v4/topics/" + str(
        topic_id) + "/feeds/essence?include=data%5B%3F(target.type%3Dtopic_sticky_module)%5D.target.data%5B%3F(target.type%3Danswer)%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F(target.type%3Dtopic_sticky_module)%5D.target.data%5B%3F(target.type%3Danswer)%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Dtopic_sticky_module)%5D.target.data%5B%3F(target.type%3Darticle)%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Dtopic_sticky_module)%5D.target.data%5B%3F(target.type%3Dpeople)%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Danswer)%5D.target.annotation_detail%2Ccontent%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F(target.type%3Danswer)%5D.target.author.badge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Darticle)%5D.target.annotation_detail%2Ccontent%2Cauthor.badge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Dquestion)%5D.target.annotation_detail%2Ccomment_count&limit=" + str(
        limit) + "&offset=" + str(offset)
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    }
    try:
        r = requests.get(url, verify=False, headers=headers)
    except requests.exceptions.ConnectionError:
        return False
    content = r.content.decode("utf-8")
    data = json.loads(content)
    is_end = data["paging"]["is_end"]
    items = data["data"]
    if len(items) <= 0:
        return True
    pre = re.compile(">(.*?)<")
    for item in items:
        if maxnum <= 0:
            return True
        answer_id = item["target"]["id"]
        if answer_id in answer_ids:
            continue
        if item["target"]["type"] != "answer":
            continue
        if int(item["target"]["voteup_count"]) < 10000:
            continue
        answer = ''.join(pre.findall(item["target"]["content"].replace("\n", "").replace( " " , "")))
        if len(answer) == 0:
            continue
        if len(answer) > 200:
            continue
        answer_ids.append(answer_id)
        question = item["target"]["question"]["title"].replace("\n", "")
        vote_num = item["target"]["voteup_count"]
        if answer.find("<") > -1 and answer.find(">") > -1:
            pass
        sline = "=" * 50
        content = sline + "\nQ: {}\nA: {}\nvote: {}\n".format(question, answer, vote_num)
        print(content)
        save2file(content)
        maxnum -= 1
    return is_end

爬取方法中的两个参数分别表示话题 ID 以及爬取的内容是第几页。

我们可以将爬取的内容保存到文件中,实现代码如下:

def save2file(content):
    with open('result', 'a', encoding='utf-8') as file:
        file.write(content)

源码:蓝奏云链接


还没注册帐号?快来注册社区帐号,和我们一起嗨起来!
关于本社区

集各类兴趣爱好于一身的轻量化交流社区,在此您可以和他人一起分享交流您觉得有价值的内容,社区鼓励大家发表原创内容,为社区添砖加瓦!

发帖奖励 → 社区版规 → 招聘版主 →
推荐版块
扫描二维码下载社区APP
回到顶部