Position: Home page » Equipment » Scrapd mining

Scrapd mining

Publish: 2021-04-26 06:36:30
1. I have several plans:
1. Each website uses a single sketch project
2. All websites use a single sketch project, and items are written in items.py, and each website uses a spider
3. All websites use a single sketch project, and items are written in items.py, All websites have one spider
I prefer the second one
in addition, I use scrapd and supervisor to manage and monitor!
2. You can use the timing task of Linux to execute py program.
3. 1. Upload tool
scrapd client

2. Installation method:
PIP install scrapd client

3. Upload method:
Python D: &; Python27\ Scripts\ Scratch deploy target - P project
note:
target -- host address, such as localhost
Project -- project name, such as stock_ uc;

4. Premise:
1. Suppose that the installation location of Python is D: & # 92; Python27\
2. Enter the project directory to execute the upload command
③ optional parameter:
-- version R03, that is:
Python D: &# 92; Python27\ Scripts\ scrapyd-d
4. For the crawler that only reads the updated content of a few websites, there is no need to implement the incremental function in the Python code, and directly add the URL field in the item< br />item[' Url'] = Response. URL

and then set the column storing the URL to unique on the data side
after that, you can catch the exception returned by database commit in Python code, ignore it or transfer it to log
it's said that incremental is supported on the Internet. After looking at the code + actual test, I still think it's not incremental...

my approach is to open the pipeline_ When spiders, read the URLs in all items to make a parsed_ The list of URLs is in the process of rule_ Link to filter out the downloaded URLs. If necessary, you can add last to the item_ The notify property extends further
5. It costs about 10000 yuan, a license and a business license
6. Scrapy, which means scratch, is used to name the well-known framework in the crawler world
using this framework, we can easily achieve the normal web page collection. It also supports large architectures. After the upgrade, redis supports distributed computing. With scrapd, you can publish services
you must learn in the field of reptile!
7. I took a look at the scrapd API, and it seems that it doesn't support this kind of requirement. It is not only to execute a task, but also to return the data obtained by this task immediately.
8. I took a look at the scrapd API, and it seems that it doesn't support this kind of requirement. It is not only to execute a task, but also to return the data obtained by this task immediately.
9. 1. When crawling data, we sometimes encounter the situation of being blocked by the website. The status code of the response is 403. At this time, we hope to throw an exception of
closespider
2. However, as mentioned on the official website of scrapy, the default setting of scrapy is to filter out problematic HTTP responses (that is, the response status code is not between 200-300). Therefore, 403 will be ignored, which means that instead of processing the response of the URL request, we directly ignore it. In other words, we use response. Status = = 400 to judge that it doesn't work, because only requests with status between 200 and 300 will be processed
3. If we want to capture or process 403, or other requests such as 404 or 500, we put 403 in the handle in the spider class_ httpstatus_ List. Just as follows< br />class MySpider(CrawlSpider):
handle_ httpstatus_ List = [403]
or put 403 in HTTP error_ ALLOWED_ In codes setting,
that is, adding httperror in settings_ ALLOWED_ CODES = [403], HTTPERROR_ ALLOWED_ The default value of codes is []
http: / / doc. Sketch.org/en/1.0/topics/spider-middleware.html # httperror allowed codes
4. After setting handle_ httpstatus_ List or HTTP error_ ALLOWED_ After codes, the closespider exception can be thrown by judging response. Status = = 403 to end the capture.
Hot content
Inn digger Publish: 2021-05-29 20:04:36 Views: 341
Purchase of virtual currency in trust contract dispute Publish: 2021-05-29 20:04:33 Views: 942
Blockchain trust machine Publish: 2021-05-29 20:04:26 Views: 720
Brief introduction of ant mine Publish: 2021-05-29 20:04:25 Views: 848
Will digital currency open in November Publish: 2021-05-29 19:56:16 Views: 861
Global digital currency asset exchange Publish: 2021-05-29 19:54:29 Views: 603
Mining chip machine S11 Publish: 2021-05-29 19:54:26 Views: 945
Ethereum algorithm Sha3 Publish: 2021-05-29 19:52:40 Views: 643
Talking about blockchain is not reliable Publish: 2021-05-29 19:52:26 Views: 754
Mining machine node query Publish: 2021-05-29 19:36:37 Views: 750