背景

由于Leetcode网站高频率地打不开,想到了做个Leetcode客户端,通过自己的服务器来查看Leetcode的题目。
那么现在需要的就是Leetcode里面的题目信息和如何实现用户登录。
这时候爬虫当然是首选啦。

观察网页请求

把浏览器cookies清除掉后,从登录页面开始,访问你需要爬取信息的网页,看数据是如何请求的。

由于Leetcode有中英双版页面,那么爬取不同的数据就需要走两遍同样的流程。
Host:

  1. 中文Leetcode:leetcode-cn.com
  2. 英文Leetcode: leetcode.com
    主页:
  3. 中文Leetcode: https://leetcode-cn.com
  4. 英文Leetcode: https://leetcode.com

下面以中文Leetcode为例,介绍爬取流程。
~来代替主页前缀。

登录

API接口: ~/accounts/login
首先发送一次GET请求,模拟打开登录页面。
然后填好用户名和密码后,发送登录POST请求:

1
2
3
4
5
6
{
"csrfmiddlewaretoken":"上次更新的token值"
"login":"katherineleeyq@163.com"
"password":""
"next":"/problems"
}

服务器收到GET请求后,会返回给你一个Cookie,里面包含一个token值,这个token值在你登录的POST请求里必须附上,否则服务器将认定登录请求非法。

服务器收到登录POST请求后,验证用户合法性,如果判断登录成功,也会返回给你一个Cookie,同样包含上次的token,不过token的值更新了。这个Cookie就作为该登录用户的标志。

用户进度(用户登录之后)

API: ~/progress/all
发送GET请求,获取当前登录用户的刷题进度。
服务器返回Json:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"solvedPerDifficulty": {
"Easy": 38,
"Medium": 12,
"Hard": 2
},
"XP": 47,
"solvedTotal": 52,
"questionTotal": 832,
"attempted": 0,
"sessionList": [{
"id": 189400,
"name": ""
}],
"leetCoins": 47,
"sessionName": "",
"unsolved": 780
}

题目简洁信息(免登陆可获取)

API接口: ~/api/problemset/all
发送GET请求,获取所有问题的简洁信息,包括:

  1. 题目后端id
  2. 题目前端显示id
  3. 题目标题
  4. 题目难度
  5. 题目AC数

    服务器返回的是一个json数据,包含了如下内容:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
"user_name": "",
"num_solved": 0,
"num_total": 832,
"ac_easy": 0,
"ac_medium": 0,
"ac_hard": 0,
"stat_status_pairs": [{
"stat": {
"question_id": 1043,
"question__article__live": null,
"question__article__slug": null,
"question__title": "Grid Illumination",
"question__title_slug": "grid-illumination",
"question__hide": false,
"total_acs": 175,
"total_submitted": 665,
"frontend_question_id": 1001,
"is_new_question": false
},
"status": null,
"difficulty": {
"level": 3
},
"paid_only": false,
"is_favor": false,
"frequency": 0,
"progress": 0
},
...]
}

question_id是后端id,question_frontend_id是前端展示id。

题目标题翻译(英文版Leetcode不需要做这一步)

API: ~/graphql
Params: getQuestionTranslation
发送POST请求,获取所有问题对应的标题翻译(中文)。

1
2
3
4
5
6
7
8
9
10
11
12
13
{
"data": {
"translations": [{
"title": "\u4e24\u6570\u4e4b\u548c",
"question": {
"questionId": "1",
"__typename": "QuestionNode"
},
"__typename": "AppliedTranslationNode"
},
...]
}
}

遍历之前获取的题目信息,根据题目后端id,将题目的翻译标题添加进信息中。

题目标签(免登陆获取)

API: ~/problems/api/tags
发送GET请求,获取所有题目存在的分类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"companies": [],
"topics": [{
"slug": "stack",
"name": "Stack",
"questions": [20, 42, 71, 84, 85, 94, 103, 144, 145, 150, 155, 173, 224, 225, 232, 255, 272, 316, 331, 341, 385, 394, 402, 439, 785, 456, 496, 503, 591, 636, 682, 726, 735, 739, 781, 874, 883, 886, 916, 931, 937, 943, 957, 983, 1017],
"translatedName": "栈"
}, {
"slug": "heap",
"name": "Heap",
"questions": [23, 215, 218, 239, 253, 264, 295, 313, 347, 355, 358, 373, 378, 407, 451, 502, 659, 692, 719, 744, 761, 778, 789, 794, 802, 803, 836, 887, 895, 902, 918],
"translatedName": "堆"
},
...
]
}

题目详情(免登陆可获取)

API: ~/graphql
Params:

1
2
3
4
5
6
7
{
"operationName": "questionData",
"variables": {
"titleSlug": "two-sum"
},
"query": "query questionData($titleSlug: String!) {\n question(titleSlug: $titleSlug) {\n questionId\n questionFrontendId\n boundTopicId\n title\n titleSlug\n content\n translatedTitle\n translatedContent\n isPaidOnly\n difficulty\n likes\n dislikes\n isLiked\n similarQuestions\n contributors {\n username\n profileUrl\n avatarUrl\n __typename\n }\n langToValidPlayground\n topicTags {\n name\n slug\n translatedName\n __typename\n }\n companyTagStats\n codeSnippets {\n lang\n langSlug\n code\n __typename\n }\n stats\n hints\n solution {\n id\n canSeeDetail\n __typename\n }\n status\n sampleTestCase\n metaData\n judgerAvailable\n judgeType\n mysqlSchemas\n enableRunCode\n enableTestMode\n envInfo\n __typename\n }\n}\n"
}

根据titleSlug,发送POST请求
获取服务器的返回json:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
{
"data": {
"question": {
"questionId": "1",
"questionFrontendId": "1",
"boundTopicId": 2,
"title": "Two Sum",
"titleSlug": "two-sum",
"content": "<p>Given an array of integers, return <strong>indices</strong> of the two numbers such that they add up to a specific target.</p>",
"translatedTitle": "两数之和",
"translatedContent": "<p>给定一个整数数组 <code>nums</code>&nbsp;和一个目标值 <code>target</code>,请你在该数组中找出和为目标值的那&nbsp;<strong>两个</strong>&nbsp;整数,并返回他们的数组下标。</p>",
"isPaidOnly": false,
"difficulty": "Easy",
"likes": 4300,
"dislikes": 0,
"isLiked": null,
"similarQuestions": "[{\"title\": \"3Sum\", \"titleSlug\": \"3sum\", \"difficulty\": \"Medium\", \"translatedTitle\": \"\三\数\之\和\"}, ...]",
"contributors": [],
"langToValidPlayground": "{\"cpp\": true, \"java\": true, \"python\": true, \"python3\": true, \"mysql\": false, \"mssql\": false, \"oraclesql\": false, \"c\": false, \"csharp\": false, \"javascript\": false, \"ruby\": false, \"bash\": false, \"swift\": false, \"golang\": false, \"scala\": false, \"html\": false, \"pythonml\": false, \"kotlin\": false, \"rust\": false, \"php\": false}",
"topicTags": [{
"name": "Array",
"slug": "array",
"translatedName": "数组",
"__typename": "TopicTagNode"
}, {
"name": "Hash Table",
"slug": "hash-table",
"translatedName": "哈希表",
"__typename": "TopicTagNode"
}],
"companyTagStats": null,
"codeSnippets": [{
"lang": "C++",
"langSlug": "cpp",
"code": "class Solution {\r\npublic:\r\n vector<int> twoSum(vector<int>& nums, int target) {\r\n \r\n }\r\n};",
"__typename": "CodeSnippetNode"
}, {
"lang": "Java",
"langSlug": "java",
"code": "class Solution {\r\n public int[] twoSum(int[] nums, int target) {\r\n \r\n }\r\n}",
"__typename": "CodeSnippetNode"
}, ...
],
"stats": "{\"totalAccepted\": \"256.4K\", \"totalSubmission\": \"572.8K\", \"totalAcceptedRaw\": 256444, \"totalSubmissionRaw\": 572768, \"acRate\": \"44.8%\"}",
"hints": [],
"solution": {
"id": "30",
"canSeeDetail": true,
"__typename": "ArticleNode"
},
"status": null,
"sampleTestCase": "[2,7,11,15]\n9",
"metaData": "{\r\n \"name\": \"twoSum\",\r\n \"params\": [\r\n {\r\n \"name\": \"nums\",\r\n \"type\": \"integer[]\"\r\n },\r\n {\r\n \"name\": \"target\",\r\n \"type\": \"integer\"\r\n }\r\n ],\r\n \"return\": {\r\n \"type\": \"integer[]\",\r\n \"size\": 2\r\n }\r\n}",
"judgerAvailable": true,
"judgeType": "small",
"mysqlSchemas": [],
"enableRunCode": true,
"enableTestMode": false,
"envInfo": "{...}",
"__typename": "QuestionNode"
}
}
}

收藏列表 & 榜单

API: ~/graphql
Params:

1
2
3
4
5
{
"operationName": "allFavorites",
"variables": {},
"query": "query allFavorites {\n favoritesLists {\n allFavorites {\n idHash\n name\n isPublicFavorite\n questions {\n questionId\n __typename\n }\n __typename\n }\n officialFavorites {\n idHash\n name\n questions {\n questionId\n __typename\n }\n __typename\n }\n __typename\n }\n}\n"
}

发送POST请求,获取当前用户的收藏列表。(附带获得了隐藏的榜单信息)
服务器返回Json:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
"data": {
"favoritesLists": {
"allFavorites": [{
"idHash": "9lwifn7",
"name": "Favorite",
"isPublicFavorite": false,
"questions": [{
"questionId": "1",
"__typename": "QuestionNode"
}, {
"questionId": "240",
"__typename": "QuestionNode"
},
...
],
"__typename": "FavoriteNode"
}],
"officialFavorites": [{
"idHash": "ex0k24j",
"name": "腾讯精选练习(50 题)",
"questions": [{
"questionId": "2",
"__typename": "QuestionNode"
}, {
"questionId": "4",
"__typename": "QuestionNode"
},
...
],
"__typename": "FavoriteNode"
}],
"__typename": "FavoritesNode"
}
}
}

提交记录

API: ~/graphql
Params:

1
2
3
4
5
6
7
8
9
10
{
"operationName": "Submissions",
"variables": {
"offset": 0,
"limit": 20,
"lastKey": null,
"questionSlug": "two-sum"
},
"query": "query Submissions($offset: Int!, $limit: Int!, $lastKey: String, $questionSlug: String!) {\n submissionList(offset: $offset, limit: $limit, lastKey: $lastKey, questionSlug: $questionSlug) {\n lastKey\n hasNext\n submissions {\n id\n statusDisplay\n lang\n runtime\n timestamp\n url\n isPending\n memory\n __typename\n }\n __typename\n }\n}\n"
}

服务器返回Json:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
"data": {
"submissionList": {
"lastKey": "prjjt165",
"hasNext": false,
"submissions": [{
"id": "13196529",
"statusDisplay": "Accepted",
"lang": "cpp",
"runtime": "6 ms",
"timestamp": "1503338328",
"url": "/submissions/detail/13196529/",
"isPending": "Not Pending",
"memory": "N/A",
"__typename": "SubmissionDumpNode"
}, ...],
"__typename": "SubmissionListNode"
}
}
}

面试信息

API: ~/graphql
Params:

1
2
3
4
5
{
"operationName": "interviewOptions",
"variables": {},
"query": "query interviewOptions {\n interviewed {\n interviewedUrl\n companies {\n id\n name\n slug\n __typename\n }\n timeOptions {\n id\n name\n __typename\n }\n stageOptions {\n id\n name\n __typename\n }\n __typename\n }\n}\n"
}

服务器返回Json:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
{
"data": {
"interviewed": {
"interviewedUrl": "/problems/api/interviewed/",
"companies": [{
"id": 584,
"name": "58赶集",
"slug": "58",
"__typename": "InterviewCompanyOption"
}, {
"id": 552,
"name": "Adobe",
"slug": "adobe",
"__typename": "InterviewCompanyOption"
}, {
"id": 480,
"name": "Aetion",
"slug": "aetion",
"__typename": "InterviewCompanyOption"
},...],
"timeOptions": [{
"id": 0,
"name": "last week",
"__typename": "InterviewTimeOption"
}, {
"id": 1,
"name": "last month",
"__typename": "InterviewTimeOption"
}, {
"id": 2,
"name": "last 3 months",
"__typename": "InterviewTimeOption"
}, {
"id": 3,
"name": "last 6 months",
"__typename": "InterviewTimeOption"
}, {
"id": 4,
"name": "more than 6 months",
"__typename": "InterviewTimeOption"
}, {
"id": 5,
"name": "other",
"__typename": "InterviewTimeOption"
}],
"stageOptions": [{
"id": 0,
"name": "Online Assessment",
"__typename": "InterviewStageOption"
}, {
"id": 1,
"name": "Phone Interview",
"__typename": "InterviewStageOption"
}, {
"id": 4,
"name": "On Campus Interview",
"__typename": "InterviewStageOption"
}, {
"id": 2,
"name": "Onsite Interview",
"__typename": "InterviewStageOption"
}, {
"id": 3,
"name": "Don't know",
"__typename": "InterviewStageOption"
}],
"__typename": "InterviewSurveyNode"
}
}
}