一个图片网站的单页图片下载 爬虫实例
网站是www.sj96.com
编写这个程序主要是学习爬虫技术
目标是下载页面所有的图片
程序已经基本实现,程序内容如下
import requests import re import time import os #===============================读取网页内容====================================== url='http://www.sj96.com/beauty/photos/64743.html' headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36' } html=requests.get(url=url,headers=headers) html=html.text #==============================解析网页获取图片地址======================================= tagg='<a class="imgbox" href=".*?">.*?<img src="(.*?)"/>.*?</a>' urls=re.findall('<img src="(.*?)"/>',html) title=re.findall('<title>(.*?)_四季图片</title>',html) #======处理urls==== new_url=[] for values in urls: haspic='caiji' https='http' old_imgurl='https://img.99ym.cn' imgurlhost='http://192.250.198.123' if haspic not in values: urls.remove(values) if https not in values: urls.remove(values) if old_imgurl not in values: continue else: value=values.replace(old_imgurl, imgurlhost) new_url.append(value) try: new_url.remove('/static/index/img/mob/icon-navlist.png') except: pass print(new_url) tuheader={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36' } #====================创建一个图片目录========================== fristdir='upload' if not os.path.exists(fristdir): os.mkdir(fristdir) dirname=fristdir+'/'+title[0] if not os.path.exists(dirname): os.mkdir(dirname) num=len(new_url) print('共将采集'+str(num)+'张图片') i=1 for url in new_url: time.sleep(2) try: file_name=url.split('/')[-1] response=requests.get(url,headers=tuheader) with open(dirname+'/'+file_name,'wb') as f: f.write(response.content) print('第'+str(i)+'张图片下载成功') except: print('第' + str(i) + '张图片下载失败') i+=1
非特殊说明,本文版权归原作者所有,转载请注明出处
评论列表
发表评论