tags: python
Scrapy (install the Scrapy framework through pip pip install Scrapy) and Python3.x installation tutorial can find the tutorial to install by yourself, here by default it has been successfully configured in the Windows environment.
Before starting crawling, a new Scrapy project must be created. Win+R open the cmd command window and run the following command:
cd desktop
scrapy startproject huya
At this point, the huya project folder has been created on the desktop, and the files in the folder are explained as follows:
scrapy.cfg: project configuration file
huya/: The Python module of the project, the code will be quoted from here
huya/items.py: the target file of the project
huya/pipelines.py: the pipeline file of the project
huya/settings.py: project settings file
huya/spiders/: Store crawler code directory
Then, enter the huya/spiders/ directory, and enter the cmd command as follows:
cd huya
cd huya
cd spiders
Enter the command in the current directory, the file huyaspider.py will be created in the huya/spiders/ directory, and the crawling domain range will be specified: huya.com
scrapy genspider huyaspider "huya.com"
Open the file items.py in the huya/huya/ directory, the code is as follows:
# -*- coding: utf-8 -*-
import scrapy
class HuyaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# Room theme
nickname = scrapy.Field()
# Link
imagelink = scrapy.Field()
# Storage path
imagePath = scrapy.Field()
Open the file huyaspider.py in the huya/huya/spiders/ directory and modify the code as follows:
# -*- coding: utf-8 -*-
import scrapy
from huya.items import HuyaItem
class HuyaspiderSpider(scrapy.Spider):
name = 'huyaspider'
allowed_domains = ['huya.com']
start_urls = ['https://www.huya.com/g/2168']
def parse(self, response):
# Match all root node list collections through scrapy's own xpath
image_list = response.xpath('//div[@class="box-bd"]/ul/li')
for img_each in image_list:
huyaItem=HuyaItem()
huyaItem["nickname"] = img_each.xpath("./a/img[@class='pic']/@title").extract()[0]
huyaItem["imagelink"] = img_each.xpath("./a/img[@class='pic']/@data-original").extract()[0]
yield huyaItem
Open the file pipelines.py in the huya/huya/ directory and modify the code as follows:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline
import os
class HuyaPipeline(ImagesPipeline):
# def process_item(self, item, spider):
# return item
# Get the variable value set in the settings file
IMAGES_STORE = get_project_settings().get("IMAGES_STORE")
headers = get_project_settings().get("DEFAULT_REQUEST_HEADERS")
def get_media_requests(self, item, info):
image_url = item["imagelink"]
# headers is the request header is mainly anti-crawler
yield scrapy.Request(image_url, headers=self.headers)
def item_completed(self, result, item, info):
image_path = [x["path"] for ok, x in result if ok]
# # Create a directory if the directory does not exist
if os.path.exists(self.IMAGES_STORE) == False:
os.mkdir(self.image_path)
os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["nickname"] + ".jpg")
item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"]
return item
Open the file setting.py in the huya/huya/ directory,Add tocode show as below:
IMAGES_STORE = "C:/Users/**Username**/Desktop/huya/Images"
Open the file setting.py in the huya/huya/ directory,modifycode show as below:
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'User-Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#'Accept-Language': 'en',
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'huya.pipelines.HuyaPipeline': 300,
}
cmd command window:
scrapy crawl huyaspider
Start pycharm, create a start_huya.py file in the huya/ directory, and add the code as follows:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy import cmdline
cmdline.execute("scrapy crawl huyaspider".split())


from selenium import webdriver Simulate the browser and visit Huya To send the barrage, we need to place the cursor in the input box The input box has an id value: pub_msg_input We can locate the inpu...
Huya Live Reptile Project: Operation result: (intercepted part) ...
Code Crawled HTML reference result...
Python: Use scrapy to crawl pictures, the crawled pictures are welfare pictures, and the program has detailed comments, so there are no more code words 1. Create a project 2. Create your own spider fi...
aims: Photos of sisters from major webcasting platforms ~~~~(Betta fish, Panda, Huya, National People, Battle Flag, etc.) Python crawler entry level! Required modules: re, urllib, os are all python bu...
step 1: Step 2: Step 3: Go to setting to make relevant settings Step 4: Perform data analysis 4.1: Persistent storage based on terminal commands 4.2: Pipe-based persistent storage Step 5: Set i...
Why does the dynamic ajax page have multiple pages on the same page? It is because when you click on another page, the server returns a string of json strings. js executes the json parameters to achie...
BEIJING, January 15th, on the 14th, the 8th China Public Welfare Festival with the theme of “Public Welfare Creates Good” was held in Beijing. Tiger Tooth Live won the 2018 Public Welfare ...
This article is aboutWhat is the most uncomfortable thing about watching Huya Miss live broadcast?[IT168 Comment] As a well-deserved "e-sports sister" among the female anchors of "LOL&q...