Day01

1.认识网页的结构

html+css+JavaScript
可以做一个简单页面来熟悉标签,单纯的Html+Css
image.png

课堂作业展示:
image.png

2.爬取本地网页的信息,筛选高分文章

学习如何使用BeautifulSoup解析网页
学习如何用CSS Selector 描述要爬取的元素位置
把需要的信息筛选出来,放到字典里面

  1. pip install bs4
  2. pip install lxml
  1. from bs4 import BeautifulSoup as BS
  2. data = []
  3. path = '.\\web\\new_index.html'
  4. with open(path, 'r') as f:
  5. # print(f.read())
  6. Soup = BS(f.read(), 'lxml')
  7. titles = Soup.select('ul > li > div.article-info > h3 > a')
  8. pics = Soup.select('ul > li > img')
  9. descs = Soup.select('ul > li > div.article-info > p.description')
  10. rates = Soup.select('ul > li > div.rate > span')
  11. cates = Soup.select('ul > li > div.article-info > p.meta-info')
  12. print(titles + pics + descs + rates + cates)
  13. for title, pic, desc, rate, cate in zip(titles, pics, descs, rates, cates):
  14. info = {
  15. 'title': title.get_text(),
  16. 'pic': pic.get('src'),
  17. 'descs': desc.get_text(),
  18. 'rate': rate.get_text(),
  19. 'cate': list(cate.stripped_strings)
  20. }
  21. data.append(info)
  22. for i in data:
  23. if len(i['rate']) >= 3:
  24. print(i['title'], i['cate'])

获取想要元素的selector,ctrl+shift+c 点击想要获取的位置,在右边的地方右键选择复制selector获取
body > div.main-content > ul > li:nth-child(1) > div.article-info > h3
image.png

Day02

1.爬取Tripadvistor的数据

爬取选中的小标题并保存到txt中

  1. from bs4 import BeautifulSoup as BS
  2. import request
  3. url = ''
  4. headers = {
  5. 'User-Agent' : '' , #当前浏览器访问url的请求数据
  6. 'Cookie' : ''
  7. }
  8. web_data = request.get(url, headers = headers)
  9. #查看是否访问网站,请求成功
  10. print("访问状态码" + web_data.status_code)
  11. with open('./data.txt', 'a', encoding = 'utf-8') as file:
  12. soup = BS(web_data.txt, 'lxml')
  13. titles = soup.select('') #里面放你想筛选的网页属性selector
  14. for title in titles:
  15. data = {
  16. 'titles' : title.get_text()
  17. }
  18. print(data)
  19. file.write(str(data) + '\n')

image.png
小结:
爬取一个真实的网站,TripAdvistor
理解Request和Response的原理
明白Request库的get方法怎么用
真实网页中定位元素位置的方法,找唯一特征
使用headers,如何连续爬取多页内容的方法,尝试走手机端爬取

  1. def get_attractions(url, data=None):
  2. my_data = requests.get(url, headers=headers)
  3. print(my_data.status_code)
  4. time.sleep(3)
  5. with open('./data_cate.txt', "a", encoding='utf-8') as file:
  6. soup = BS(my_data.text, 'lxml')
  7. # titles = soup.select('div.fdltM > div.WlYyy.cPsXC.biNiR.cKUMi.dpKLb.eYhTT.cWWWn.fmARL.fPuGtitlea')
  8. titles = soup.select('div.eHyiI > span > div')
  9. cates = soup.select('article > div.eLWnh.P0 > div.IcpoT.P0 > div.bTLYC.P0 > div > div')
  10. imgs = soup.select('li.bBdQR._A.bxQEm > picture > img')
  11. for title, cate, img in zip(titles, cates, imgs):
  12. data = {
  13. 'title' : title.get_text(),
  14. 'cate' : list(cate.stripped_strings),
  15. 'img' : img.get('src')
  16. }
  17. print(data)
  18. file.write(str(data) + '\n')
  19. get_attractions(url)
  20. for tmp_url in urls:
  21. get_attractions(tmp_url)

image.png

2.bootstrap的小商城单页实现

image.png

  1. from bs4 import BeautifulSoup as BS
  2. data = {}
  3. path = './homepage/index.html'
  4. with open(path, 'r', encoding='utf-8') as web_data:
  5. Soup = BS(web_data, 'lxml')
  6. titles = Soup.select('body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div > div > div.caption > h4:nth-child(2) > a')
  7. images = Soup.select('body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div > div > img')
  8. reviews = Soup.select('body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div > div > div.ratings > p.pull-right')
  9. prices = Soup.select('body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div > div > div.caption > h4.pull-right')
  10. stars = Soup.select('body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div > div > div.ratings > p:nth-child(2)')
  11. # print(titles, images, reviews, prices, stars, sep='\n---\n')
  12. with open('ans.txt', 'a', encoding='utf-8') as ans:
  13. for title, image, review, price, star in zip(titles, images, reviews, prices, stars):
  14. data = {
  15. 'title' : title.get_text(),
  16. 'image' : image.get('src'),
  17. 'review' : review.get_text(),
  18. 'price' : price.get_text(),
  19. 'star' : len(star.find_all('span', class_='glyphicon glyphicon-star'))
  20. }
  21. ans.write(str(data) + '\n')

image.png

3.爬取小猪租房得房屋信息(app反爬虫)

不会
image.png

Day03

1.爬取weheartit网站(有反爬!)

点击网页查看源码,直接复制到本地
可以用urllib.request.urlretrieve来下载图片
image.png

  1. def get_url(url):
  2. img_urls = []
  3. with open(full_url, 'r', encoding='utf-8') as web_data:
  4. soup = BeautifulSoup(web_data,'lxml')
  5. imgs = soup.select('#main-container > div.grid-responsive > div.col.span-content > div > div > div > div > div > a > img')
  6. for i in imgs :
  7. print(i.get('src'))
  8. img_urls.append(i.get('src'))
  9. print((len(img_urls)),'images shall be downloaded!')
  10. return img_urls
  11. def dl_image(url):
  12. urllib.request.urlretrieve(url,path+url.split('/')[-3] + url.split('/')[-2] + '.jpg')
  13. time.sleep(3)
  14. print('Done')

2.爬取58同城二手笔记本的数据

被封ip了好像

  1. urlt = 'http://bj.58.com/pingbandiannao/24604629984324x.shtml'
  2. url = 'https://bj.58.com/shouji/46176488358684x.shtml'
  3. wb_data = requests.get(url, headers = headers)
  4. with open('./bj.58.com_spider/58_phone.html', 'a', encoding='utf-8') as book:
  5. book.write(wb_data.text)
  6. print('slider status:' + str(wb_data.status_code))