Here is a snippet from [YACG] Yet Another Content Generator to scrape wikipedia articles. Great for content generation and arbitrage. Here is the code:
Usage:
1 | <? wikipedia("http://en.wikipedia.org/wiki/Google") ?> |
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | <?php function wikipedia($article) { $pattern[0] = '/<a href="(.*?)">(.*?)<\\/a>/'; $replace[0] = '$2'; $pattern[1] = '/<h3 id=\"siteSub\">From Wikipedia, the free encyclopedia<\/h3>/'; $replace[1] = ''; $pattern[2] = '/<div id=\"contentSub\">(.*?)<\/div><div id=\"jump-to-nav\">Jump to: navigation, search<\/div>/'; $replace[2] = ''; $pattern[3] = '/<div class=\"messagebox cleanup metadata\">(.*?)<p><br \/><\/p>/'; $replace[3] = ''; $pattern[4] = '/<table class=\"messagebox\" (.*?)>(.*?)<\/table>/'; $replace[4] = ''; $pattern[5] = '/<dl>(.*?)<\/dl>/'; $replace[5] = ''; $pattern[6] = '/<h1 class=\"firstHeading"\>(.*?)<\/h1>/'; $replace[6] = '<h3>$1</h3>'; $pattern[7] = '/<table class=\"messagebox protected\" style=\"border: 1px solid #8888aa; padding: 0px; font-size:9pt;\">(.*?)<\/table>/'; $replace[7] = ''; $pattern[8] = '/<div class=\"infobox sisterproject\">(.*?)<\/div><\/div>/'; $replace[8] = ''; $pattern[9] = '/<sup (.*?)>(.*?)<\/sup>/'; $replace[9] = ''; $pattern[10] = '/<table style=\"background: transparent;\" width=\"0\">(.*?)<\/table>/'; $replace[10] = ''; $pattern[11] = '/<table class=\"messagebox current\" style=\"font-size: normal;\">(.*?)<\/table>/'; $replace[11] = ''; $pattern[12] = '/<table class=\"toccolours\" align=\"center\" width=\"55%\" cellpadding=\"0\" cellspacing=\"0\">(.*?)<\/table>/'; $replace[12] = ''; $pattern[13] = '/<div class=\"editsection\"(.*?)>(.*?)<\/div>/'; $replace[13] = ''; $pattern[14] = '/<div id=\"bodyContent\">/'; $replace[14] = '<div>'; $pattern[15] = '/<dd>(.*?)<\/dd>/'; $replace[15] = ''; $pattern[16] = '/<div class=\"messagebox cleanup metadata\">(.*?)<\/div>/'; $replace[16] = ''; $pattern[17] = '/<div class=\"thumbcaption\">(.*?)<\/div><\/div>/'; $replace[17] = ''; $pattern[18] = '/<div class=\"thumb tright\">/'; $replace[18] = ''; $pattern[19] = '/\[(.*?)\]/'; $replace[19] = ''; $pattern[20] = '/<table class="messagebox protected" (.*?)>(.*?)<\/table>/'; $replace[20] = ''; $pattern[21] = '/<div style="position:absolute; z-index:100; right:20px; top:10px; height:10px; width:300px;"><\/div>/'; $replace[21] = ''; $pattern[22] = '/<div style="position:absolute; z-index:100; right:10px; top:10px;" class="metadata" id="administrator">(.*?)<\/div><\/div>/'; $replace[22] = ''; $pattern[23] = '/<table class="messagebox current"(.*?)>(.*?)<\/table>/'; $replace[23] = ''; $pattern[24] = '/<table class="messagebox current" style="width: auto;">(.*?)<\/table>/'; $replace[24] = ''; $pattern[25] = '/<div class="dablink">(.*?)<\/div>/'; $replace[25] = ''; $pattern[26] = '/<b>/'; $replace[26] = '<strong>'; $pattern[27] = '/<\/b>/'; $replace[27] = '</strong>'; $pattern[28] = '/<div(.*?)>/'; $replace[28] = ''; $pattern[29] = '/<\/div>/'; $replace[29] = ''; $pattern[30] = '/<map(.*?)>(.*?)<\/map>/'; $replace[30] = ''; $pattern[31] = '/<img src="(.*?)" alt="This page is semi-protected." width="18" (.*?)\/>/'; $replace[31] = ''; $pattern[32] = '/<table style="width:100%;background:none">(.*?)<\/table>/'; $replace[32] = ''; $pattern[33] = '/<div class="messagebox merge metadata">(.*?)<\/div>/'; $replace[33] = ''; $wikipedia = fopen($article, "r"); $wikipedia = preg_replace($pattern, $replace, $wikipedia); if (preg_match("/<\!-- start content --\>(.*)<table id=\"toc\" class=\"toc\" summary=\"(.*)\">/", $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match("/<\!-- start content --\>(.*)<a name=\"(.*)\">/is", $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"boilerplate metadata\" id=\"stub\">/is", $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"printfooter\">/is", $wikipedia, $w)) { $wikipedia = $w[1]; } } print $wikipedia; } ?> |
The regex to remove all the trash that wikipedia adds to the articles sucks, so I’m looking for someone to help me with it. Interested? Drop me a line!
I don’t know much about BH stuff but I’m willing to learn, my question is do you save the code as? wikipedia.php? Then you use this
to call it whenever you need some article or content from wikipedia?
Yes, you save it as wikipedia.php, or any other filename and whenever you need a clean wikipedia article, you just call the function.
sweeet!
One more question…where do you input the “keyword” or “topic” you want the script to scrap for?
i tried it and this is what came up:
Resource id #1
Interisting, but their demo looks a little rough. I have created one myself at http://www.killertux.com/node/44 . It features caching, the option to not display images, and a text only version with no links. It is avaliable with source code on my site.
One note: I am using PHP5 and I get the same “Resource ID #1″ error with the script as it currently is. I changed the line:
$wikipedia = fopen($article, "r");
to:
$wikipedia = file_get_contents($article);
It works, but if you do an inexact term search you could get a 403 error page as ‘content’ so there is still an issue doing it this way…
code giving that error.
Parse error: syntax error, unexpected ‘}’ in wikipedia.php on line 85
any standalone flickr scraper hook available, like youtube example here. thanks for contributing.
This doesn’t really work….
Some web hosts (such as dreamhost) disable fopen and file_get_contents on their servers. To get around that, use curl instead:
REPLACE:
$wikipedia = fopen($article, “r”);
WITH:
$ch = curl_init();
$timeout = 0;
curl_setopt ($ch, CURLOPT_URL, $article);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$wikipedia = curl_exec($ch);
curl_close($ch);