Using php to download a page and grab some content
#1
Posted 12 November 2008 - 02:30 AM
Can anyone suggest how I can do this?
I want, following the input of a website address from a form (say), for my php to get info from that webpage, and use it to output info from that webpage to the screen.
I have a mad idea to try out but in essence, I have to be able to get my php to go to a webpage and pick out the links. I am not intending to spam or create crappy auto generated content.
There are sites that I can input a website address and it displays all the links from that page. I want to do that but am going to do something different (hence the mad idea) with the data.
Could anyone give me a suggestion of how I could achieve this using php?
Any advice welcomed,
Paul.
#2
Posted 12 November 2008 - 04:13 AM
The question is, how much effort are you willing to put into this versus how much you're expected to be handed to you...??? We're willing to help you learn how to achieve your goal if you're willing to attempt to learn... Otherwise, you might as well pay someone to write the code... I know I can write the code, but I'd rather help you learn how to do it yourself...
EDIT: Sorry, Paul... Didn't realize it was you I was replying to... Didn't mean to come across gruff in my reply... I'd just come from another site and had gotten all worked up over people not doing their part researching before asking for help (as in "do it for me")...
#3
Posted 16 November 2008 - 03:34 PM
I too am amazed at the forums where people ask questions like 'how do I do x' when in the documentation the first point is 'How to do x' fully explained and documented.
Thank you though, now I have two expressions to google for, SGREP and Screen Scrape.I have searched for sgrep and I can see I have my reading cut out for me.
sgrep [-aCcDdhilNnPqSsTtV] [-O filename] [-o "format"] [ -p preprocessor] -f filename [-e expression] [filename ...]
OMG! I am sure that when I get to grips with it all this stuff will become more intelligible to a novice like me!
Am still totally in the dark. Will post my progress if anyone else is interested. I am quite sure I can search a text string for the relevant patterns, the question that remains is how to get PHP to request and recieve a webpage, and turn the resulting stream into a text file, so I can search it. From preliminary reading, it seems I might be able to search the stream directly without converting to a text file first.
I only do this in my spare time now due to the credit crunch. My customers have all dried up and I am now temping for Disney! Yuch!
Hope everyone else is faring better than I did!
Paul.
PS Have just realised that this is different than regexp. Just ignore all my comments as I still do not know what I am talking about. Will spend a couple of evenings reading about this I think. My so called 'mad idea' is really cool (growing on me all the time) and as soon as I get anything working will post for your comments! Like most ideas I have though, cant really see a use for it, other than being 'quite interesting'.
Edited by PaulD, 16 November 2008 - 03:37 PM.
#4
Posted 16 November 2008 - 08:43 PM
I've been playing with regexp for data sanitation and if I stumble across the correct code I'll be sure to let you know... I have to re-learn regexp every time I need to use it because it's got a lot of complexities that you forget if you don't use regexp continuously...
#5
Posted 17 November 2008 - 12:12 AM
Firstly I had to get the server to download a page. I had great trouble with cURL and couldnt get a connection anywhere. I also tried fopen using a url and hoping it would deal with all the parameters automatically.
Finally got something working with help from this website which was very useful.
You can see my results here but this is only a test page using a blank template for speed and clarity (wont be available forever).
Here is the code I am using in a snippet, just have to call it from a page (this is exactly as it is in my test page, perhaps someone will find it useful). When you input a webpage the snippet calls the webpage and outputs it. This was only to test I could get the info and that cURL was working properly on my server.
$url = (isset($_POST['scrape']))? $_POST['scrape'] : 'http://www.example.com';
$fetch = (isset($_POST['fetchresults']))? 1 : 0;
// INPUT BOX
// sets up the input box for telling us the url. If the url is set it will be shown in here automatically
$inputbox='
<form name="input" action="index.php?id=322" method="post">
Webpage :
<input type="text" name="scrape" value="'.$url.'" size="40">
<input type="hidden" name="fetchresults" value="yes">
<input type="submit" value="Submit">
</form> <small>eg http://www.example.com </small>
';
// Results output
// if first visit so no data is fetched
if ($fetch==0) {
// NO RESULTS YET
$results="<h3>Results</h3><p>Please enter a URL above and click the submit button.</p>";
}
else {
// RESULTS FROM URL
$results="<br /><h3>Results for ".$url."</h3>";
$results .= '<a href="'.$url.'">'.$url.'</a><br />';
// DOWNLOADING PAGE BIT
// is curl installed?
if (!function_exists('curl_init')){
return "<h1>cURL is not installed</h1>";
}
else {
$messages .= "<p>inside curl loop so it is installed</p>";
// create a new curl resource
$ch = curl_init();
// set URL to download
curl_setopt($ch, CURLOPT_URL, $url);
// set referer:
curl_setopt($ch, CURLOPT_REFERER, "http://www.independentwebadvice.co.uk/");
// user agent:
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101");
// remove header? 0 = yes, 1 = no
curl_setopt($ch, CURLOPT_HEADER, 0);
// should curl return or print the data? true = return, false = print
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$messages = "<p>Finished setting all the options</p>";
// download the given URL, and return output
$output2 = curl_exec($ch);
$messages = "<p>finished downloading url</p>";
// close the curl resource, and free system resources
curl_close($ch);
} // closes if cURL allowed bracket
// END OF DOWNLOADING PAGE BIT
} // closes if fetch is = 1
// Output page bits
$output = $inputbox;
$output .= $results;
$output .= $messages;
$output .= $output2;
return $output;Now that is working (to a certain extent, I cant do my own site - I presume it is timing out), I can now try and extract the urls.
My idea may fall down though, as it seems to take an age to download the pages, and I would need to do about ten at least. I am hoping that it is only because I am sending the data (html head and rel links and all) straight to the page. Perhaps when instead I am manipulating it and sending only a link or two that it will work a lot faster.
Just posting all this stuff really because I am really pleased to get it working, at least this first bit.
Thanks for the help again,
Paul.
#6
Posted 17 November 2008 - 04:54 AM
<?php
// hyperlink_finder.php
$regexp =
<<<REGEXP
/<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/siu
REGEXP;
$markup =
<<<CODE
Visit the Etomite <a href="http://www.etomite.com/" title="">Forums</a> for support information.<br/>
And the Etomite <a class="links" href="http://docs.etomite.slyip.com">Documentation</a> site too...<br/>
<hr/>
CODE;
preg_match_all($regexp, $markup, $matches);
echo $markup;
echo "<pre>";
print_r($matches);
echo "</pre>";
foreach($matches[1] as $match)
{
echo $match."<br/>";
}
?>EDIT: An alternative regexp which only looks for the href would be:
$regexp = <<<REGEXP /href[\s]?=[\s\"\']+(.*?)[\"\']/siu REGEXP;
#7
Posted 17 November 2008 - 03:02 PM
I was up until the wee hours last night toying with regexp and preg_match_all - what a nightmare. You should have seen some of the results I got (when I actually got some) from the pattern masks! Really complicated to get my head around but I must admit it was really good fun trying. Here was where I was up to
// Manipulate data with regular expression
// <a\shref(.)*/a>
// preg_match_all("\<a\shref(.)*a\>", $html, $matches, PREG_SET_ORDER);
// foreach ($matches as $val) {
// $output2 .= $val[0] . "\n";
// }
// patterns that work but incorrectly
// ([^\<a\shref](.*)[/a\>$])
// ([^\<a\shref](.)*[\/a\>$])
// ([^\<](.)*[\/a\>$])
preg_match_all("(^\<][\>$])", $html, $matches, PREG_SET_ORDER);
foreach ($matches as $val) {
$output2 .= $val[0] . "<br />";
}None of them worked of course and at about 4 in the morning I had to give up!
Was thinking about it today when I logged in to see your post! Fantastic - just what I needed. I must admit I had thought about asking for help with the expression but thought I would plough on for a bit longer. I used many websites but none seemed to give an overview or explanation that made any real sense to me. Again, this is probably because I am self taught and have massive gaps in my knowledge. Was just starting to get the hang of it but could not see how to get the links.
Cant wait to try your suggestions above, although that will have to wait until later. Regular expressions are amazingly complicated and the ? check ahead I dont get at all yet (in principle I do but not in practice). I think I got sidelined for a while with javascript expressions, which are a bit different I think.
Thank you again Ralph, you are amazing!
Paul.
#9
Posted 17 November 2008 - 05:51 PM
#10
Posted 17 November 2008 - 08:31 PM
very cool PaulDIt worked!
You can see it here
#11
Posted 18 November 2008 - 12:25 AM
@Chris.D
Hey Chris, thank you! Wait till my mad idea is working! I cant see a use for it apart from it might be Quite Interesting.
I have here a page that now lets you get only the external links. Doing etomite.com is interesting in itself with a weird one as well. I was really pleased with this but it is not perfect, but will do for my purposes for now.
Take a look now here and do etomite.com
What it does is to scrapte etomite.com for the list of urls, but ignoring all the internal links which I managed with this
foreach ($matches as $val) {
// before outputting need to check that it is not an internal link
// will test for the existance of http first, if not there could or must be an internal link and should not be displayed.
$pos = strpos($val[0], "http");
if ($pos === false) {
$output .= "The string HTTP was not found in the teststring";
}
else {
// now we will explode again around the . again
// this assumes no extra dots will appear
// we will check to see if the first three pieces matches $domainname (set above)
// if not we can use it in our link list. If it does then forget it.
// explode the val[0] on .
$pieces = explode(".", $val[0]);
// test 1 and 2 and 3 to see if they are not = domainname
// the idea here is to catch http://www.mydomain.co.uk as well as http://docs.blog.mydomain.co.uk
// I am assuming that 3 is enough and that if I only have two bits the if statement will still hold since false!=domainname
if (($pieces[1]!=$domainname)&&($pieces[2]!=$domainname)&&($pieces[3]!=$domainname)) {
// output domain if it obeys these criteria
// $output2 .= $val[0] . "<br />";
$pathlist[] .= $val[0];
} // closes if piece1 = domain
} // closes else from 'if pos === false'
} // closes foreach I was quite pleased with that.
Then it adds all the external links into an array, and picks a random element to display of the external links only
// does it have multiple entries
elseif(count($pathlist)>=2) {
// perform random pick of array key
$randomlink = array_rand($pathlist);
// output the random key from the original pathlist array
$output2 .= $pathlist[$randomlink]."<br />";
} // closes pathlist >2This is in amongst testing to see if there are no links, one link or more. If more than 1 link I do this rand_array thing which is a super cool function (for silly things like this).
So now the page outputs a single random external link from the page scraped.
My mad idea is to now generate the next link in the path using the same process, but with the new external link. That means I will be making a virtual (albeit random at the moment) path through the web from the originally specified website. I am hoping that it will produce intresting results.
I am not sure if I will be putting excess load on the web server (shared) or even how I will manage the loop. But this is the most exciting thing I have done in ages! (How sad am I) I might even build an entire site around it and stick on some google ads or something similar. See if I can get the loop working first.
Paul.
PS When I get time of course.
#12
Posted 18 November 2008 - 12:45 AM
#13
Posted 18 November 2008 - 12:49 AM
#14
Posted 18 November 2008 - 12:56 AM
#15
Posted 18 November 2008 - 01:56 AM
You talking about spidering the entire site, Dean...???What'd be really awesome would be for it to follow links and output a list of all the links on a site.. rather than just the page chosen.
#16
Posted 18 November 2008 - 06:31 AM
#17
Posted 18 November 2008 - 08:10 AM
LolBut this is the most exciting thing I have done in ages! (How sad am I)
I agree an option for your script to be blocked would be a good thing. How about naming it a bot, and respecting a Disallow in robots.txt.
#18
Posted 18 November 2008 - 08:35 AM
Yeah, an Entire Site... It'd be useful from a developer's point of view (if it also passed the properties of <a> tags through), to make sure that the properties were all the same throughout the site.I think they invented search engines for that purpose Dean...
Also, what about listing the pages that they are on?
Search engine spiders are only as useful as the age of the information that they contain...
#19
Posted 18 November 2008 - 11:16 AM
#20
Posted 18 November 2008 - 02:51 PM
I think it will but I will wait for my host to complain.Wont it put excess load on the server and the server on the receiving end? Also, it is possible to block it and manipulate the results
That could be done but would mean parsing hundreds of pages, not just a handful, so probably a bit beyond me really and my humble hosting facilities. Imagine the delays involved before the page returned!What'd be really awesome would be for it to follow links and output a list of all the links on a site.. rather than just the page chosen.
@fishnchips
Really good point. I do not know how to do that but will look into it and add that. I did have notes to exclude sites like google etc that dont allow this in their terms. I should do this before making it public (although there are a hundred things to do before then with it. Just glad to get it working at all really.)I agree an option for your script to be blocked would be a good thing. How about naming it a bot, and respecting a Disallow in robots.txt.
@Cris D.
I am so pleased with this myself that I am already thinking of a hundred different things to do with it. I am also waiting for my host to complain!I recon you have the beginning of a site aggregator
Now this bit is done all the really hard stuff starts - the devil is in the detail as they say.
Will post anything I get working here just in case anyone is interested. Thanks for all the input though. Really appreciated.
Paul.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users











