「Python筆記」python爬蟲簡單實戰

「Python筆記」python爬蟲簡單實戰

資源介紹參數
資源類別: Python
如遇問題: 聯繫客服/留言反饋
使用requests+BeautifulSoup+sqlalchemy+pymysql爬取貓眼TOP100並寫入數據庫和txt文檔做題用到爬蟲正好複習一下一些東西,爬取貓眼TOP100電影,並用sqlalchemy寫入數據庫,並寫入txt文檔。

使用requests+BeautifulSoup+sqlalchemy+pymysql爬取貓眼TOP100並寫入數據庫和txt文檔

做題用到爬蟲正好複習一下一些東西,爬取貓眼TOP100電影,並用sqlalchemy寫入數據庫,並寫入txt文檔

先做好數據庫連接的配置

  1. from sqlalchemy import create_engine,Column,Integer,String,Text
  2. from sqlalchemy.ext.declarative import declarative_base
  3. from sqlalchemy.orm import sessionmaker
  4. HOSTNAME = '127.0.0.1'
  5. DATABASE = 'movies'
  6. PORT = '3306'
  7. USERNAME = 'root'
  8. PASSWORD = 'root'
  9. DB_URL = "mysql+pymysql://{username}:{password}@{host}:{port}/{database}?charset=utf8mb4".format(username=USERNAME,password=PASSWORD,host=HOSTNAME, port=PORT,database=DATABASE)
  10. engine = create_engine(DB_URL)
  11. conn = engine.connect()
  12. Base = declarative_base()
  13. Session = sessionmaker(engine)()

創建數據表

  1. class Movies(Base):
  2. __tablename__ = 'movies'
  3. index = Column(Integer,primary_key=True,autoincrement=True)
  4. src = Column(Text,nullable=False)
  5. name = Column(String(50),nullable=False)
  6. actor = Column(String(50),nullable=False)
  7. time = Column(String(50),nullable=False)
  8. score = Column(String(50),nullable=False)
  9. Base.metadata.create_all(engine)
  10. alter = 'alter table movies convert to character set utf8mb4;'
  11. conn.execute(alter)

要注意執行修改字符集語句,否賊無法寫入

分析結構

  1. from bs4 import BeautifulSoup
  2. import requests
  3. import re
  4. def main(index):
  5. req = requests.get(url.format(str(index)))
  6. soup = BeautifulSoup(req.text, "html5lib")
  7. for item in soup.select('dd'):
  8. pass

分析結構可以看出,每一部電影都寫在一個<dd>Array中,只要獲取到這個Array,再向下搜尋就能得到想要的數據

爬取數據

  1. def get_index(item):
  2. index = item.select_one("i").text
  3. return index
  4. def get_src(item):
  5. img_src = item.select("img")[1]
  6. template = re.compile('data-src="(.*?)"')
  7. img_src = template.findall(str(img_src))[0]
  8. return img_src
  9. def get_name(item):
  10. name = item.select(".name")[0].text
  11. return name
  12. def get_actor(item):
  13. actor = item.select(".star")[0].text.split(':')[1]
  14. return actor
  15. def get_time(item):
  16. time = item.select(".releasetime")[0].text.split(':')[1]
  17. return time
  18. def get_score(item):
  19. score = item.select('.integer')[0].text + item.select('.fraction')[0].text
  20. return score

獲取需要的信息,因為srcdata-scr中,所以這裡我用正則去獲取。

構造dict

  1. def get_dict(item):
  2. index = int(get_index(item))
  3. src = get_src(item)
  4. name = get_name(item)
  5. actor = get_actor(item)
  6. time = get_time(item)
  7. score = get_score(item)
  8. movies_dict = {'index': index, 'src': src, 'name': name, 'actor': actor, 'time': time, 'score': score}
  9. return movies_dict

將爬取的數據整理成dict(寫完後覺得這步沒有必要)

寫入txt

  1. def write_file(content):
  2. content = json.dumps(content,ensure_ascii=False)
  3. with open('result.txt','a') as f:
  4. f.write(content +'n')

這裡需要將dictjson.dumps方法編碼成json字符串,否則無法寫入

寫入數據庫

  1. def write_to_mysql(content):
  2. src = content['src']
  3. name = content['name']
  4. actor = content['actor'].split('n')[0]
  5. time = content['time']
  6. score = content['score']
  7. data = Movies(src = src,name=name,actor=actor,time=time,score=score)
  8. Session.add(data)
  9. Session.commit()

在主函式中調用

  1. def main(index):
  2. req = requests.get(url.format(str(index)))
  3. soup = BeautifulSoup(req.text, "html5lib")
  4. for item in soup.select('dd'):
  5. movies_dict = get_dict(item)
  6. write_to_mysql(movies_dict)
  7. write_file(movies_dict)

爬取所有頁面

  1. for i in range(10):
  2. main(i*10)

完整代碼

  1. from bs4 import BeautifulSoup
  2. from sqlalchemy import create_engine,Column,Integer,String,Text
  3. from sqlalchemy.ext.declarative import declarative_base
  4. from sqlalchemy.orm import sessionmaker
  5. import requests
  6. import re
  7. import json
  8. HOSTNAME = '127.0.0.1'
  9. DATABASE = 'movies'
  10. PORT = '3306'
  11. USERNAME = 'root'
  12. PASSWORD = 'root'
  13. DB_URL = "mysql+pymysql://{username}:{password}@{host}:{port}/{database}?charset=utf8mb4".format(username=USERNAME,password=PASSWORD,host=HOSTNAME, port=PORT,database=DATABASE)
  14. engine = create_engine(DB_URL)
  15. conn = engine.connect()
  16. Base = declarative_base()
  17. Session = sessionmaker(engine)()
  18. class Movies(Base):
  19. __tablename__ = 'movies'
  20. index = Column(Integer,primary_key=True,autoincrement=True)
  21. src = Column(Text,nullable=False)
  22. name = Column(String(50),nullable=False)
  23. actor = Column(String(50),nullable=False)
  24. time = Column(String(50),nullable=False)
  25. score = Column(String(50),nullable=False)
  26. Base.metadata.create_all(engine)
  27. alter = 'alter table movies convert to character set utf8mb4;'
  28. conn.execute(alter)
  29. def get_index(item):
  30. index = item.select_one("i").text
  31. return index
  32. def get_src(item):
  33. img_src = item.select("img")[1]
  34. template = re.compile('data-src="(.*?)"')
  35. img_src = template.findall(str(img_src))[0]
  36. return img_src
  37. def get_name(item):
  38. name = item.select(".name")[0].text
  39. return name
  40. def get_actor(item):
  41. actor = item.select(".star")[0].text.split(':')[1]
  42. return actor
  43. def get_time(item):
  44. time = item.select(".releasetime")[0].text.split(':')[1]
  45. return time
  46. def get_score(item):
  47. score = item.select('.integer')[0].text + item.select('.fraction')[0].text
  48. return score
  49. def get_dict(item):
  50. index = int(get_index(item))
  51. src = get_src(item)
  52. name = get_name(item)
  53. actor = get_actor(item)
  54. time = get_time(item)
  55. score = get_score(item)
  56. movies_dict = {'index': index, 'src': src, 'name': name, 'actor': actor, 'time': time, 'score': score}
  57. return movies_dict
  58. def write_file(content):
  59. content = json.dumps(content,ensure_ascii=False)
  60. with open('result.txt','a') as f:
  61. f.write(content +'n')
  62. def write_to_mysql(content):
  63. src = content['src']
  64. name = content['name']
  65. actor = content['actor'].split('n')[0]
  66. time = content['time']
  67. score = content['score']
  68. data = Movies(src = src,name=name,actor=actor,time=time,score=score)
  69. Session.add(data)
  70. Session.commit()
  71. def main(index):
  72. req = requests.get(url.format(str(index)))
  73. soup = BeautifulSoup(req.text, "html5lib")
  74. for item in soup.select('dd'):
  75. movies_dict = get_dict(item)
  76. write_to_mysql(movies_dict)
  77. write_file(movies_dict)
  78. url = 'https://maoyan.com/board/4?offset={}'
  79. for i in range(10):
  80. main(i*10)

使用selenium爬取空間說說

配置驅動,模擬登入

  1. from selenium import webdriver
  2. import time
  3. qq = input("請輸入qq號")
  4. ss_url ='https://user.qzone.qq.com/{}/311'.format(qq)
  5. driver = webdriver.Chrome("chromedriver.exe")
  6. driver.maximize_window()
  7. driver.get(ss_url)
  8. driver.switch_to.frame('login_frame')
  9. driver.find_element_by_class_name('face').click()
  10. next_page='page'
  11. page=1

抓取說說

  1. while next_page:
  2. time.sleep(2)
  3. # driver.implicitly_wait(100)
  4. driver.switch_to.frame('app_canvas_frame')
  5. content = driver.find_elements_by_css_selector('.content')
  6. stime = driver.find_elements_by_css_selector('.c_tx.c_tx3.goDetail')
  7. print('正在抓取第%s頁'%page)
  8. for con, sti in zip(content, stime):
  9. data = {
  10. 'time': sti.text,
  11. 'shuos': con.text
  12. }
  13. print(data)
  14. time.sleep(1)

使用zip構建元組來遍歷
使用time.sleep()來等待頁面加載(因為隱式等待和顯示等待沒搞明白,所以用強制等待。。。。)

翻頁

  1. next_page = driver.find_element_by_link_text('下一頁')
  2. page = page+1
  3. next_page.click()
  4. driver.switch_to.parent_frame()

翻頁後要使用driver.switch_to.parent_frame()找到上策frame,否則無法定位Array

完整代碼

  1. from selenium import webdriver
  2. import time
  3. qq = input("請輸入qq號")
  4. ss_url ='https://user.qzone.qq.com/{}/311'.format(qq)
  5. driver = webdriver.Chrome("chromedriver.exe")
  6. driver.maximize_window()
  7. driver.get(ss_url)
  8. driver.switch_to.frame('login_frame')
  9. driver.find_element_by_class_name('face').click()
  10. next_page='page'
  11. page=1
  12. while next_page:
  13. time.sleep(2)
  14. # driver.implicitly_wait(100)
  15. driver.switch_to.frame('app_canvas_frame')
  16. content = driver.find_elements_by_css_selector('.content')
  17. stime = driver.find_elements_by_css_selector('.c_tx.c_tx3.goDetail')
  18. print('正在抓取第%s頁'%page)
  19. for con, sti in zip(content, stime):
  20. data = {
  21. 'time': sti.text,
  22. 'shuos': con.text
  23. }
  24. print(data)
  25. time.sleep(1)
  26. next_page = driver.find_element_by_link_text('下一頁')
  27. page = page+1
  28. next_page.click()
  29. driver.switch_to.parent_frame()
聲明:本文為原創作品,版權歸作者所有。未經許可,不得轉載或用於任何商業用途。如若本站內容侵犯了原著者的合法權益,可聯繫我們進行處理。
0 條回復 A文章作者 M管理員
歡迎您,新朋友,感謝參與互動!
    暫無討論,說說你的看法吧