爬虫小结

爬虫小结

爬了个小站,其实对方主要是在防注入(?),并没有怎么反爬所以其实不太困难,美中不足的是写挂了好多地方,然后有懒得增量更新爬到的结果,所以反反复复算是爬了三次,给服务器带来了不少压力。

其实是第一次用py写模拟登录(惭愧)

首先是患者列表,在翻页的时候抓包发现没有请求,翻开发者工具发现是藏在IndexedDB里面,然后补习了一下导出的姿势

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
var db;
var DB_NAME = '';
var DB_VERSION = 1;
var STORE_NAME = '';
var request = indexedDB.open(DB_NAME, DB_VERSION);
request.onupgradeneeded = function() {
// Create a new object store if this is the first time we're using
// this DB_NAME/DB_VERSION combo.
request.result.createObjectStore(STORE_NAME, {autoIncrement: true});
};
request.onsuccess = function() {
db = request.result;
// Enable our buttons once the IndexedDB instance is available.
};
var res;
var transaction = db.transaction(STORE_NAME, 'readonly');
var objectStore = transaction.objectStore(STORE_NAME);
if ('getAll' in objectStore) {
// IDBObjectStore.getAll() will return the full set of items in our store.
objectStore.getAll().onsuccess = function(event) {
res = event.target.result;
};
}

然后就是常规操作,模拟登录什么的,然后有一个坑点是request库遇到302会跟着重定向过去,但是cookie没有存= =

解决方案是开个Session

1
2
REQ = requests.Session()
r = REQ.post(loginUrl, headers=header, data=payload)

对于Cookie的处理选择用dict,然后面向StackOverflow编程学习到了崭新的更新dict的操作

1
2
dic = REQ.cookies.get_dict()
cookie = {**cookie, **dic}

然后也学习到了从Chrome的工具里header变成python dict的方法和把dict变成Cookie串的方法= =
回想当年自己真的是太naive了呀

1
2
3
4
5
6
7
# Convert Chrome headers to Python's Requests dictionary
dict([[h.partition(':')[0], h.partition(':')[2]] for h in rawheaders.split('\n')])
def gen(s): # Converts dict to cookie string
res = ""
for k in s:
res += k + '=' + s[k] + ';'
return res

以及regex中加个括号得到group可以只选中匹配出来的字符串的一部分,真的赞!

捕捉异常的方法:

1
2
3
try:
except Exception as e:
logging.exception("message")

创建文件夹:

1
2
3
4
5
6
if not os.path.exists('{}_res/{}'.format(CRAWL_ID, idx)):
try:
os.makedirs('{}_res/{}'.format(CRAWL_ID, idx))
except OSError as exc: # Guard against race condition
if exc.errno != errno.EEXIST:
raise # 套在 try except里面

下载图片:

1
2
3
4
photoRes = requests.get(curUrl, headers=newHeader)
# if photoRes.status_code == 200:
with open('{}_res/{}/{}.{}'.format(CRAWL_ID, idx, idNum, curName.split('.')[-1]), 'ab') as f:
f.write(photoRes.content)

最后一个坑:

1
photoNames = [x['picId'] for x in jsonRes['result'] if "X光" in str(x['tags'])]

注意到x['tags']可以是None 所以加上 str