Wikipedia Scraper

Here is a snippet from [YACG] Yet Another Content Generator to scrape wikipedia articles. Great for content generation and arbitrage. Here is the code:
Usage:

1
<? wikipedia("http://en.wikipedia.org/wiki/Google") ?>

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
<?php
function wikipedia($article)	{
	$pattern[0] = '/<a href="(.*?)">(.*?)<\\/a>/';
	$replace[0] = '$2';
	$pattern[1] = '/<h3 id=\"siteSub\">From Wikipedia, the free encyclopedia<\/h3>/';
	$replace[1] = '';
	$pattern[2] = '/<div id=\"contentSub\">(.*?)<\/div><div id=\"jump-to-nav\">Jump to: navigation, search<\/div>/';
	$replace[2] = '';
	$pattern[3] = '/<div class=\"messagebox cleanup metadata\">(.*?)<p><br \/><\/p>/';
	$replace[3] = '';
	$pattern[4] = '/<table class=\"messagebox\" (.*?)>(.*?)<\/table>/';
	$replace[4] = '';
	$pattern[5] = '/<dl>(.*?)<\/dl>/';
	$replace[5] = '';
	$pattern[6] = '/<h1 class=\"firstHeading"\>(.*?)<\/h1>/';
	$replace[6] = '<h3>$1</h3>';
	$pattern[7] = '/<table class=\"messagebox protected\" style=\"border: 1px solid #8888aa; padding: 0px; font-size:9pt;\">(.*?)<\/table>/';
	$replace[7] = '';
	$pattern[8] = '/<div class=\"infobox sisterproject\">(.*?)<\/div><\/div>/';
	$replace[8] = '';
	$pattern[9] = '/<sup (.*?)>(.*?)<\/sup>/';
	$replace[9] = '';
	$pattern[10] = '/<table style=\"background: transparent;\" width=\"0\">(.*?)<\/table>/';
	$replace[10] = '';
	$pattern[11] = '/<table class=\"messagebox current\" style=\"font-size:	normal;\">(.*?)<\/table>/';
	$replace[11] = '';
	$pattern[12] = '/<table class=\"toccolours\" align=\"center\" width=\"55%\" cellpadding=\"0\" cellspacing=\"0\">(.*?)<\/table>/';
	$replace[12] = '';
	$pattern[13] = '/<div class=\"editsection\"(.*?)>(.*?)<\/div>/';
	$replace[13] = '';
	$pattern[14] = '/<div id=\"bodyContent\">/';
	$replace[14] = '<div>';
	$pattern[15] = '/<dd>(.*?)<\/dd>/';
	$replace[15] = '';
	$pattern[16] = '/<div class=\"messagebox cleanup metadata\">(.*?)<\/div>/';
	$replace[16] = '';
	$pattern[17] = '/<div class=\"thumbcaption\">(.*?)<\/div><\/div>/';
	$replace[17] = '';
	$pattern[18] = '/<div class=\"thumb tright\">/';
	$replace[18] = '';
	$pattern[19] = '/\[(.*?)\]/';
	$replace[19] = '';
	$pattern[20] = '/<table class="messagebox protected" (.*?)>(.*?)<\/table>/';
	$replace[20] = '';
	$pattern[21] = '/<div style="position:absolute; z-index:100; right:20px; top:10px; height:10px; width:300px;"><\/div>/';
	$replace[21] = '';
	$pattern[22] = '/<div style="position:absolute; z-index:100; right:10px; top:10px;" class="metadata" id="administrator">(.*?)<\/div><\/div>/';
	$replace[22] = '';
	$pattern[23] = '/<table class="messagebox current"(.*?)>(.*?)<\/table>/';
	$replace[23] = '';
	$pattern[24] = '/<table class="messagebox current" style="width: auto;">(.*?)<\/table>/';
	$replace[24] = '';
	$pattern[25] = '/<div class="dablink">(.*?)<\/div>/';
	$replace[25] = '';
	$pattern[26] = '/<b>/';
	$replace[26] = '<strong>';
	$pattern[27] = '/<\/b>/';
	$replace[27] = '</strong>';
	$pattern[28] = '/<div(.*?)>/';
	$replace[28] = '';
	$pattern[29] = '/<\/div>/';
	$replace[29] = '';
	$pattern[30] = '/<map(.*?)>(.*?)<\/map>/';
	$replace[30] = '';
	$pattern[31] = '/<img src="(.*?)" alt="This page is semi-protected." width="18" (.*?)\/>/';
	$replace[31] = '';
	$pattern[32] = '/<table style="width:100%;background:none">(.*?)<\/table>/';
	$replace[32] = '';
	$pattern[33] = '/<div class="messagebox merge metadata">(.*?)<\/div>/';
	$replace[33] = '';
	$wikipedia = fopen($article, "r");
	$wikipedia = preg_replace($pattern, $replace, $wikipedia);
		if (preg_match("/<\!-- start content --\>(.*)<table id=\"toc\" class=\"toc\" summary=\"(.*)\">/", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/<\!-- start content --\>(.*)<a name=\"(.*)\">/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"boilerplate metadata\" id=\"stub\">/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"printfooter\">/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		}
	}
	print $wikipedia;
}
?>

The regex to remove all the trash that wikipedia adds to the articles sucks, so I’m looking for someone to help me with it. Interested? Drop me a line!

13 Responses to “Wikipedia Scraper”


  1. 1 Sinned

    I don’t know much about BH stuff but I’m willing to learn, my question is do you save the code as? wikipedia.php? Then you use this

    to call it whenever you need some article or content from wikipedia?

  2. 2 busin3ss

    Yes, you save it as wikipedia.php, or any other filename and whenever you need a clean wikipedia article, you just call the function.

  3. 3 Sinned

    sweeet!

  4. 4 Sinned

    One more question…where do you input the “keyword” or “topic” you want the script to scrap for?

  5. 5 vbignacio

    i tried it and this is what came up:

    Resource id #1

  6. 6 killertux

    Interisting, but their demo looks a little rough. I have created one myself at http://www.killertux.com/node/44 . It features caching, the option to not display images, and a text only version with no links. It is avaliable with source code on my site.

  7. 7 cashflowrusty

    One note: I am using PHP5 and I get the same “Resource ID #1″ error with the script as it currently is. I changed the line:

    $wikipedia = fopen($article, "r");

    to:

    $wikipedia = file_get_contents($article);

    It works, but if you do an inexact term search you could get a 403 error page as ‘content’ so there is still an issue doing it this way…

  8. 8 masterofpuppets

    code giving that error.

    Parse error: syntax error, unexpected ‘}’ in wikipedia.php on line 85

  9. 9 masterofpuppets

    any standalone flickr scraper hook available, like youtube example here. thanks for contributing.

  10. 10 Dan

    This doesn’t really work….

  11. 11 handsomemans

    Thank you so much

  12. 12 Paulo

    Some web hosts (such as dreamhost) disable fopen and file_get_contents on their servers. To get around that, use curl instead:

    REPLACE:
    $wikipedia = fopen($article, “r”);

    WITH:
    $ch = curl_init();
    $timeout = 0;
    curl_setopt ($ch, CURLOPT_URL, $article);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $wikipedia = curl_exec($ch);
    curl_close($ch);

  13. 13 Tom

    Nice function, I played with it for awhile to get what I needed out of it. My version still isn’t perfect, but it strips a lot of stuff that I didn’t want out.

    function wikipedia($article) {
    $article=explode(’ ‘,$article);
    $article=implode(’_',$article);
    $article=”http://en.wikipedia.org/wiki/”.urlencode($article);
    $pattern[0] = ‘/(.*?)/’;
    $replace[0] = ‘$2′;
    $pattern[1] = ‘/From Wikipedia, the free encyclopedia/’;
    $replace[1] = ”;
    $pattern[2] = ‘/(.*?)Jump to: navigation, search/’;
    $replace[2] = ”;
    $pattern[3] = ‘/(.*?)/’;
    $replace[3] = ”;
    $pattern[4] = ‘/(.*?)/’;
    $replace[4] = ”;
    $pattern[5] = ‘/(.*?)/’;
    $replace[5] = ”;
    $pattern[6] = ‘/(.*?)/’;
    $replace[6] = ‘$1′;
    $pattern[7] = ‘/(.*?)/’;
    $replace[7] = ”;
    $pattern[8] = ‘/(.*?)/’;
    $replace[8] = ”;
    $pattern[9] = ‘/(.*?)/’;
    $replace[9] = ”;
    $pattern[10] = ‘/(.*?)/’;
    $replace[10] = ”;
    $pattern[11] = ‘/(.*?)/’;
    $replace[11] = ”;
    $pattern[12] = ‘/(.*?)/’;
    $replace[12] = ”;
    $pattern[13] = ‘/(.*?)/’;
    $replace[13] = ”;
    $pattern[14] = ‘//’;
    $replace[14] = ”;
    $pattern[15] = ‘/(.*?)/’;
    $replace[15] = ”;
    $pattern[16] = ‘/(.*?)/’;
    $replace[16] = ”;
    $pattern[17] = ‘/(.*?)/’;
    $replace[17] = ”;
    $pattern[18] = ‘//’;
    $replace[18] = ”;
    $pattern[19] = ‘/\[(.*?)\]/’;
    $replace[19] = ”;
    $pattern[20] = ‘/(.*?)/’;
    $replace[20] = ”;
    $pattern[21] = ‘//’;
    $replace[21] = ”;
    $pattern[22] = ‘/(.*?)/’;
    $replace[22] = ”;
    $pattern[23] = ‘/(.*?)/’;
    $replace[23] = ”;
    $pattern[24] = ‘/(.*?)/’;
    $replace[24] = ”;
    $pattern[25] = ‘/(.*?)/’;
    $replace[25] = ”;
    $pattern[26] = ‘//’;
    $replace[26] = ‘‘;
    $pattern[27] = ‘//’;
    $replace[27] = ‘
    ‘;
    $pattern[28] = ‘//’;
    $replace[28] = ”;
    $pattern[29] = ‘//’;
    $replace[29] = ”;
    $pattern[30] = ‘/(.*?)/’;
    $replace[30] = ”;
    $pattern[31] = ‘//’;
    $replace[31] = ”;
    $pattern[32] = ‘/(.*?)/’;
    $replace[32] = ”;
    $pattern[33] = ‘/(.*?)/’;
    $replace[33] = ”;
    $wikipedia = file_get_contents($article, “r”);
    $wikipedia = preg_replace($pattern, $replace, $wikipedia);
    if (preg_match(”/(.*)/”, $wikipedia, $w)) {
    $wikipedia = $w[1];
    } elseif (preg_match(”/(.*)
    /is”, $wikipedia, $w)) {
    $wikipedia = $w[1];
    } elseif (preg_match(”/(.*)/is”, $wikipedia, $w)) {
    $wikipedia = $w[1];
    } elseif (preg_match(”/(.*)/is”, $wikipedia, $w)) {
    $wikipedia = $w[1];
    }
    $wikipedia=explode(’name=”References”‘,$wikipedia);
    $wikipedia=$wikipedia[0];
    $wikipedia=preg_replace(’@]*?>.*?@si’,”,$wikipedia);
    $me=$wikipedia;
    $tags = ‘applet|bgsound|blink|body|button|form|frame|frameset|head|table|tr|td|div|img|h2|h1|ul|a|h3|h4|span|ol|li|dl|strong|i|b|’;
    $tags .= ‘html|iframe|ilayer|input|keygen|label|layer|link|object|optgroup|option|marquee|’;
    $tags .= ‘meta|noframes|nolayer|noscript|param|select’;
    $attribs = ‘onclick|ondblclick|onmousedown|onmouseup|onmouseover|ondragdrop|’;
    $attribs .= ‘onmousemove|onmouseout|onkeypress|onkeydown|onkeyup|onabort’;
    $regex = array(’@]*?>.*?@si’, ‘@]*?>.*?@si’, “@]*>@i”);
    $me = preg_replace($regex, ”, $me);
    $regex = “@(’\”\s]*)?(.*?/?>)@i”;
    $me = preg_replace($regex, ”, $me);
    $regex = “@(’\”\s]*)?(.*?/?>)@i”;
    while ( preg_match($regex, $me) )
    {
    $me = preg_replace($regex, ‘$1$2′, $me);
    }
    $regex = ‘@(\’”\s]*javascript:[^>\'"\s]*|\’[^\']*javascript:[^\']*\’|”[^"]*javascript:[^"]*”))(.*?/?>)@i’;
    while ( preg_match($regex, $me) )
    {
    $me = preg_replace($regex, ‘$1$2′, $me);
    }
    return $me;
    }

    Call it with: (If you connect this to a db, make sure to add a sleep(1); into your loop so wikipedia doesn’t block you.

    $myvar=wikipedia(”My Term”);
    print $myvar;

Leave a Reply

Quote selected text