cnblogs crawler

Author: Gao Dashan
Time: 2017/4/27

Abstract

This project develops simple web crawlers which can crawl data about articles from "www.cnblogs.com" and store them in a JSON file. Four versions were implemented as following:

Implementation

V1 single page crawler:

This crawler crawls data from the main page of "www.cnblogs.com" and write data to json file named 'blogs_data.json', it also print out data in console. Libraries used: urllib, BeautifulSoup, re, json
Each article has a number as its key in the json file.
Each article data is organized in a form:

    'key_number': {
        'view': view[i],
        'title': title[i],
        'summary': summary[i],
        'author': name[i],
        'comment': comment[i],
        'time': time[i]
    }

Console output demo:

V2 multipage crawler 1:

In order to access the next page of the main page, I initially choose to simulate the click behavior by means of utilizing a webdriver from selenium. However, the display of the web page became a bottleneck. Though a webdriver without displaying the web page is available, I turn to another approach after I found appending the href attribute of the tab to the main page url can also work.
Libraries used: BeautifulSoup, re, json, selenium

V3 multipage crawler 2:

In this program the urllib library comes back. It directly get access to any page by:

    target_url = 'http://www.cnblogs.com/sitehome/p/'+str(page_index+1)  # page_index+1: webpage number
    res = urllib.urlopen(target_url)
    soup = BeautifulSoup.BeautifulSoup(res)
    process_web_page(soup, page_index+1)

Libraries used: urllib, BeautifulSoup, re, json

V4 multithread crawler:

Based on V3, multithread is added to largely improve the performance. The "V4_data_MultiThread_crawled.json" file contains data crawled from 200 pages, which is more than 3M.
Libraries used: urllib, BeautifulSoup, re, json, threading, multiprocessing, Queue, time
Console output demo:

Result

JSON	Content	Python code
V4_data_Multithread_crawler.json	all 200 pages of data	V4_multithread_crawler.py
V3_data_Multipage_crawler.json	20 pages of data	V3_multipage_crawler.py
V2_data_multipage_low_performance.json	5 pages of data	V2_multipage_crawler_low_performance.py
V1_data_Single_page_crawler.json	data from the main page	V1_single_page_crawler.py

JSON file demo:

Key format: pageNumber_postNumber, eg: 121_14 means page 121, 15th post.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Console output demo		Console output demo
README.md		README.md
V1_data_Single_page_crawler.json		V1_data_Single_page_crawler.json
V1_p1.png		V1_p1.png
V1_single_page_crawler.py		V1_single_page_crawler.py
V2_data_multipage_low_performance.json		V2_data_multipage_low_performance.json
V2_multipage_crawler_low_performance.py		V2_multipage_crawler_low_performance.py
V3_data_Multipage_crawler.json		V3_data_Multipage_crawler.json
V3_multipage_crawler.py		V3_multipage_crawler.py
V4_MultiThread_crawler.py		V4_MultiThread_crawler.py
V4_data_Multithread_crawler.json		V4_data_Multithread_crawler.json
V4_p1.png		V4_p1.png
V4_p2.png		V4_p2.png
V4_p3.png		V4_p3.png
malongtech.png		malongtech.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cnblogs crawler

Abstract

Implementation

V1 single page crawler:

V2 multipage crawler 1:

V3 multipage crawler 2:

V4 multithread crawler:

Result

JSON file demo:

About

Releases

Packages

Languages

DashanGao/cnblogs_crawler

Folders and files

Latest commit

History

Repository files navigation

cnblogs crawler

Abstract

Implementation

V1 single page crawler:

V2 multipage crawler 1:

V3 multipage crawler 2:

V4 multithread crawler:

Result

JSON file demo:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages