Author: Gao Dashan
Time: 2017/4/27
This project develops simple web crawlers which can crawl data about articles from "www.cnblogs.com" and store them in a JSON file. Four versions were implemented as following:
This crawler crawls data from the main page of "www.cnblogs.com" and write data to json file named 'blogs_data.json', it also print out data in console.
Libraries used: urllib, BeautifulSoup, re, json
Each article has a number as its key in the json file.
Each article data is organized in a form:
'key_number': {
'view': view[i],
'title': title[i],
'summary': summary[i],
'author': name[i],
'comment': comment[i],
'time': time[i]
}
In order to access the next page of the main page, I initially choose to simulate the click behavior by means of utilizing a webdriver from selenium. However, the display of the web page became a bottleneck. Though a webdriver without displaying the web page is available, I turn to another approach after I found appending the href attribute of the tab to the main page url can also work.
Libraries used: BeautifulSoup, re, json, selenium
In this program the urllib library comes back. It directly get access to any page by:
target_url = 'http://www.cnblogs.com/sitehome/p/'+str(page_index+1) # page_index+1: webpage number
res = urllib.urlopen(target_url)
soup = BeautifulSoup.BeautifulSoup(res)
process_web_page(soup, page_index+1)
Libraries used: urllib, BeautifulSoup, re, json
Based on V3, multithread is added to largely improve the performance. The "V4_data_MultiThread_crawled.json" file contains data crawled from 200 pages, which is more than 3M.
Libraries used: urllib, BeautifulSoup, re, json, threading, multiprocessing, Queue, time
Console output demo:
JSON | Content | Python code |
---|---|---|
V4_data_Multithread_crawler.json | all 200 pages of data | V4_multithread_crawler.py |
V3_data_Multipage_crawler.json | 20 pages of data | V3_multipage_crawler.py |
V2_data_multipage_low_performance.json | 5 pages of data | V2_multipage_crawler_low_performance.py |
V1_data_Single_page_crawler.json | data from the main page | V1_single_page_crawler.py |
Key format: pageNumber_postNumber, eg: 121_14 means page 121, 15th post.