Wikipedia Scraper

Here is a snippet from [YACG] Yet Another Content Generator to scrape articles. Great for content generation and arbitrage. Here is the code:
Usage:

1
<? wikipedia("http://en.wikipedia.org/wiki/Google") ?>

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
<?php
function wikipedia($article)	{
	$pattern[0] = '/<a href="(.*?)">(.*?)<\\/a>/';
	$replace[0] = '$2';
	$pattern[1] = '/<h3 id=\"siteSub\">From Wikipedia, the free encyclopedia<\/h3>/';
	$replace[1] = '';
	$pattern[2] = '/<div id=\"contentSub\">(.*?)<\/div><div id=\"jump-to-nav\">Jump to: navigation, search<\/div>/';
	$replace[2] = '';
	$pattern[3] = '/<div class=\"messagebox cleanup metadata\">(.*?)<p><br \/><\/p>/';
	$replace[3] = '';
	$pattern[4] = '/<table class=\"messagebox\" (.*?)>(.*?)<\/table>/';
	$replace[4] = '';
	$pattern[5] = '/<dl>(.*?)<\/dl>/';
	$replace[5] = '';
	$pattern[6] = '/<h1 class=\"firstHeading"\>(.*?)<\/h1>/';
	$replace[6] = '<h3>$1</h3>';
	$pattern[7] = '/<table class=\"messagebox protected\" style=\"border: 1px solid #8888aa; padding: 0px; font-size:9pt;\">(.*?)<\/table>/';
	$replace[7] = '';
	$pattern[8] = '/<div class=\"infobox sisterproject\">(.*?)<\/div><\/div>/';
	$replace[8] = '';
	$pattern[9] = '/<sup (.*?)>(.*?)<\/sup>/';
	$replace[9] = '';
	$pattern[10] = '/<table style=\"background: transparent;\" width=\"0\">(.*?)<\/table>/';
	$replace[10] = '';
	$pattern[11] = '/<table class=\"messagebox current\" style=\"font-size:	normal;\">(.*?)<\/table>/';
	$replace[11] = '';
	$pattern[12] = '/<table class=\"toccolours\" align=\"center\" width=\"55%\" cellpadding=\"0\" cellspacing=\"0\">(.*?)<\/table>/';
	$replace[12] = '';
	$pattern[13] = '/<div class=\"editsection\"(.*?)>(.*?)<\/div>/';
	$replace[13] = '';
	$pattern[14] = '/<div id=\"bodyContent\">/';
	$replace[14] = '<div>';
	$pattern[15] = '/<dd>(.*?)<\/dd>/';
	$replace[15] = '';
	$pattern[16] = '/<div class=\"messagebox cleanup metadata\">(.*?)<\/div>/';
	$replace[16] = '';
	$pattern[17] = '/<div class=\"thumbcaption\">(.*?)<\/div><\/div>/';
	$replace[17] = '';
	$pattern[18] = '/<div class=\"thumb tright\">/';
	$replace[18] = '';
	$pattern[19] = '/\[(.*?)\]/';
	$replace[19] = '';
	$pattern[20] = '/<table class="messagebox protected" (.*?)>(.*?)<\/table>/';
	$replace[20] = '';
	$pattern[21] = '/<div style="position:absolute; z-index:100; right:20px; top:10px; height:10px; width:300px;"><\/div>/';
	$replace[21] = '';
	$pattern[22] = '/<div style="position:absolute; z-index:100; right:10px; top:10px;" class="metadata" id="administrator">(.*?)<\/div><\/div>/';
	$replace[22] = '';
	$pattern[23] = '/<table class="messagebox current"(.*?)>(.*?)<\/table>/';
	$replace[23] = '';
	$pattern[24] = '/<table class="messagebox current" style="width: auto;">(.*?)<\/table>/';
	$replace[24] = '';
	$pattern[25] = '/<div class="dablink">(.*?)<\/div>/';
	$replace[25] = '';
	$pattern[26] = '/<b>/';
	$replace[26] = '<strong>';
	$pattern[27] = '/<\/b>/';
	$replace[27] = '</strong>';
	$pattern[28] = '/<div(.*?)>/';
	$replace[28] = '';
	$pattern[29] = '/<\/div>/';
	$replace[29] = '';
	$pattern[30] = '/<map(.*?)>(.*?)<\/map>/';
	$replace[30] = '';
	$pattern[31] = '/<img src="(.*?)" alt="This page is semi-protected." width="18" (.*?)\/>/';
	$replace[31] = '';
	$pattern[32] = '/<table style="width:100%;background:none">(.*?)<\/table>/';
	$replace[32] = '';
	$pattern[33] = '/<div class="messagebox merge metadata">(.*?)<\/div>/';
	$replace[33] = '';
	$wikipedia = fopen($article, "r");
	$wikipedia = preg_replace($pattern, $replace, $wikipedia);
		if (preg_match("/<\!-- start content --\>(.*)<table id=\"toc\" class=\"toc\" summary=\"(.*)\">/", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/<\!-- start content --\>(.*)<a name=\"(.*)\">/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"boilerplate metadata\" id=\"stub\">/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"printfooter\">/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		}
	}
	print $wikipedia;
}
?>

The regex to remove all the trash that wikipedia adds to the articles sucks, so I’m looking for someone to help me with it. Interested? Drop me a line!

Similar Posts:

13 Responses to “Wikipedia Scraper”


Leave a Reply

You must login to post a comment.