Jump to content


Photo
- - - - -

Using php to download a page and grab some content


  • Please log in to reply
19 replies to this topic

#1 PaulD

PaulD

    Likes Etomite Forums!

  • Developers
  • PipPip
  • 413 posts

Posted 12 November 2008 - 02:30 AM

Hi all,

Can anyone suggest how I can do this?

I want, following the input of a website address from a form (say), for my php to get info from that webpage, and use it to output info from that webpage to the screen.

I have a mad idea to try out but in essence, I have to be able to get my php to go to a webpage and pick out the links. I am not intending to spam or create crappy auto generated content.

There are sites that I can input a website address and it displays all the links from that page. I want to do that but am going to do something different (hence the mad idea) with the data.

Could anyone give me a suggestion of how I could achieve this using php?

Any advice welcomed,

Paul.

#2 Ralph

Ralph

    Loves Etomite Forums!

  • Admin
  • 6,539 posts

Posted 12 November 2008 - 04:13 AM

So you want to screen scrape hyperlinks by using regexp to grab all <a href=>text</a> components from the external pages... Shouldn't be that hard... Most regexp sites have code to accomplish the task... You just need to manage the links the regexp grabs...

The question is, how much effort are you willing to put into this versus how much you're expected to be handed to you...??? We're willing to help you learn how to achieve your goal if you're willing to attempt to learn... Otherwise, you might as well pay someone to write the code... I know I can write the code, but I'd rather help you learn how to do it yourself...

EDIT: Sorry, Paul... Didn't realize it was you I was replying to... Didn't mean to come across gruff in my reply... I'd just come from another site and had gotten all worked up over people not doing their part researching before asking for help (as in "do it for me")...

#3 PaulD

PaulD

    Likes Etomite Forums!

  • Developers
  • PipPip
  • 413 posts

Posted 16 November 2008 - 03:34 PM

Hi Ralph,

I too am amazed at the forums where people ask questions like 'how do I do x' when in the documentation the first point is 'How to do x' fully explained and documented.

Thank you though, now I have two expressions to google for, SGREP and Screen Scrape.I have searched for sgrep and I can see I have my reading cut out for me.

sgrep [-aCcDdhilNnPqSsTtV] [-O filename] [-o "format"] [ -p preprocessor] -f filename [-e  expression] [filename ...]

OMG! I am sure that when I get to grips with it all this stuff will become more intelligible to a novice like me!

Am still totally in the dark. Will post my progress if anyone else is interested. I am quite sure I can search a text string for the relevant patterns, the question that remains is how to get PHP to request and recieve a webpage, and turn the resulting stream into a text file, so I can search it. From preliminary reading, it seems I might be able to search the stream directly without converting to a text file first.

I only do this in my spare time now due to the credit crunch. My customers have all dried up and I am now temping for Disney! Yuch!

Hope everyone else is faring better than I did!

Paul.

PS Have just realised that this is different than regexp. Just ignore all my comments as I still do not know what I am talking about. Will spend a couple of evenings reading about this I think. My so called 'mad idea' is really cool (growing on me all the time) and as soon as I get anything working will post for your comments! Like most ideas I have though, cant really see a use for it, other than being 'quite interesting'.

Edited by PaulD, 16 November 2008 - 03:37 PM.


#4 Ralph

Ralph

    Loves Etomite Forums!

  • Admin
  • 6,539 posts

Posted 16 November 2008 - 08:43 PM

Yes, regexp is totally different... The term regexp refers to the broad scope of Regular Expressions... The PHP command that would most suit your needs is preg_match_all()... With the correct regexp code in place you should be able to end up with an array of hyperlinks which have been extracted from <a></a> tag pairs... The trick to the regexp is to not only narrow down to the <a></a> tag pairs but to also strip only the contents of the href attribute contents...

I've been playing with regexp for data sanitation and if I stumble across the correct code I'll be sure to let you know... I have to re-learn regexp every time I need to use it because it's got a lot of complexities that you forget if you don't use regexp continuously...

#5 PaulD

PaulD

    Likes Etomite Forums!

  • Developers
  • PipPip
  • 413 posts

Posted 17 November 2008 - 12:12 AM

You are quite some steps ahead of me!

Firstly I had to get the server to download a page. I had great trouble with cURL and couldnt get a connection anywhere. I also tried fopen using a url and hoping it would deal with all the parameters automatically.

Finally got something working with help from this website which was very useful.

You can see my results here but this is only a test page using a blank template for speed and clarity (wont be available forever).

Here is the code I am using in a snippet, just have to call it from a page (this is exactly as it is in my test page, perhaps someone will find it useful). When you input a webpage the snippet calls the webpage and outputs it. This was only to test I could get the info and that cURL was working properly on my server.

$url = (isset($_POST['scrape']))? $_POST['scrape'] : 'http://www.example.com';
$fetch = (isset($_POST['fetchresults']))? 1 : 0;


// INPUT BOX
// sets up the input box for telling us the url. If the url is set it will be shown in here automatically
   $inputbox='
	  <form name="input" action="index.php?id=322" method="post">
			Webpage : 
			<input type="text" name="scrape" value="'.$url.'" size="40">
			<input type="hidden" name="fetchresults" value="yes">
			<input type="submit" value="Submit">
	  </form> <small>eg http://www.example.com </small>
	 ';

// Results output

// if first visit so no data is fetched
	 if ($fetch==0) {

// NO RESULTS YET
		  $results="<h3>Results</h3><p>Please enter a URL above and click the submit button.</p>"; 
	  }

	  else {

// RESULTS FROM URL

		  $results="<br /><h3>Results for ".$url."</h3>";
		  $results .= '<a href="'.$url.'">'.$url.'</a><br />';

// DOWNLOADING PAGE BIT

	// is curl installed?
	if (!function_exists('curl_init')){ 
		return "<h1>cURL is not installed</h1>";
	}
	else {

		$messages .= "<p>inside curl loop so it is installed</p>";

	   // create a new curl resource
	   $ch = curl_init();

	   // set URL to download
	   curl_setopt($ch, CURLOPT_URL, $url);
 
	   // set referer:
	   curl_setopt($ch, CURLOPT_REFERER, "http://www.independentwebadvice.co.uk/");
 
	   // user agent:
	   curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101");
 
	   // remove header? 0 = yes, 1 = no
	   curl_setopt($ch, CURLOPT_HEADER, 0);
 
	   // should curl return or print the data? true = return, false = print
	   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
 
	   // timeout in seconds
	   curl_setopt($ch, CURLOPT_TIMEOUT, 10);

		$messages = "<p>Finished setting all the options</p>";

 
	   // download the given URL, and return output
	   $output2 = curl_exec($ch);
		 $messages = "<p>finished downloading url</p>";

	   // close the curl resource, and free system resources
	   curl_close($ch);

	  }  // closes if cURL allowed bracket


// END OF DOWNLOADING PAGE BIT

	   }  // closes if fetch is = 1 


// Output page bits
	   $output = $inputbox;
	   $output .= $results;
	   $output .= $messages;
	   $output .= $output2;

	   return $output;

Now that is working (to a certain extent, I cant do my own site - I presume it is timing out), I can now try and extract the urls.

My idea may fall down though, as it seems to take an age to download the pages, and I would need to do about ten at least. I am hoping that it is only because I am sending the data (html head and rel links and all) straight to the page. Perhaps when instead I am manipulating it and sending only a link or two that it will work a lot faster.

Just posting all this stuff really because I am really pleased to get it working, at least this first bit.

Thanks for the help again,

Paul.

#6 Ralph

Ralph

    Loves Etomite Forums!

  • Admin
  • 6,539 posts

Posted 17 November 2008 - 04:54 AM

Paul, try this sample regexp code to see if it will work for you... Just trim what you need out of this stand-alone working example... This may be the piece of the puzzle you are missing...

<?php
	// hyperlink_finder.php
	
	$regexp =
	<<<REGEXP
	/<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/siu
	REGEXP;
	
	$markup = 
   <<<CODE
	Visit the Etomite <a href="http://www.etomite.com/" title="">Forums</a> for support information.<br/>
	And the Etomite <a class="links" href="http://docs.etomite.slyip.com">Documentation</a> site too...<br/>
  <hr/>
	CODE;
	
	preg_match_all($regexp, $markup, $matches);
	
  echo $markup;
  
	echo "<pre>";
	print_r($matches);
	echo "</pre>";
	
	foreach($matches[1] as $match)
	{
	  echo $match."<br/>";
	}
	
	?>

EDIT: An alternative regexp which only looks for the href would be:
$regexp =
 <<<REGEXP
 /href[\s]?=[\s\"\']+(.*?)[\"\']/siu
 REGEXP;


#7 PaulD

PaulD

    Likes Etomite Forums!

  • Developers
  • PipPip
  • 413 posts

Posted 17 November 2008 - 03:02 PM

Wow - thank you. That is perfect for me.

I was up until the wee hours last night toying with regexp and preg_match_all - what a nightmare. You should have seen some of the results I got (when I actually got some) from the pattern masks! Really complicated to get my head around but I must admit it was really good fun trying. Here was where I was up to

// Manipulate data with regular expression
// <a\shref(.)*/a>
// preg_match_all("\<a\shref(.)*a\>", $html, $matches, PREG_SET_ORDER);
// foreach ($matches as $val) {
//	 $output2 .= $val[0] . "\n";
// }

// patterns that work but incorrectly
// ([^\<a\shref](.*)[/a\>$])
// ([^\<a\shref](.)*[\/a\>$])
// ([^\<](.)*[\/a\>$])

preg_match_all("(^\<][\>$])", $html, $matches, PREG_SET_ORDER);

 foreach ($matches as $val) {
	 $output2 .= $val[0] . "<br />";
 }

None of them worked of course and at about 4 in the morning I had to give up!

Was thinking about it today when I logged in to see your post! Fantastic - just what I needed. I must admit I had thought about asking for help with the expression but thought I would plough on for a bit longer. I used many websites but none seemed to give an overview or explanation that made any real sense to me. Again, this is probably because I am self taught and have massive gaps in my knowledge. Was just starting to get the hang of it but could not see how to get the links.

Cant wait to try your suggestions above, although that will have to wait until later. Regular expressions are amazingly complicated and the ? check ahead I dont get at all yet (in principle I do but not in practice). I think I got sidelined for a while with javascript expressions, which are a bit different I think.

Thank you again Ralph, you are amazing!

Paul.

#8 PaulD

PaulD

    Likes Etomite Forums!

  • Developers
  • PipPip
  • 413 posts

Posted 17 November 2008 - 03:08 PM

:D

It worked!

You can see it here

I couldn't resist just trying out a cut and paste of the regular expression. I have no idea how it is working yet but will try to decode it tonight.

I am soooooo excited! Childishly so - Thank you again

Paul.

#9 Ralph

Ralph

    Loves Etomite Forums!

  • Admin
  • 6,539 posts

Posted 17 November 2008 - 05:51 PM

You're welcome, Paul... I'm right there with you as far as not easily finding the correct regexp for this task... What helped was some of the other code I was working on that can be used to strip specific tags, along with their contents, from markup... Slight adaptations were all that was required in order to get it to return the array of links... Hope you find it useful...

#10 Cris D.

Cris D.

    Loves Etomite Forums!

  • Developers
  • PipPipPipPip
  • 1,104 posts

Posted 17 November 2008 - 08:31 PM

I havn't seen much in the way of cross-site interactivity that works...

It worked!

You can see it here

very cool PaulD :)

#11 PaulD

PaulD

    Likes Etomite Forums!

  • Developers
  • PipPip
  • 413 posts

Posted 18 November 2008 - 12:25 AM

Sorry to keep posting about this but it is going really well - thankyou Ralph. RegExp still escapes me. I just cant 'get it'! That expression you posted still baffles me, I would never have got there myself.

@Chris.D
Hey Chris, thank you! Wait till my mad idea is working! I cant see a use for it apart from it might be Quite Interesting.

I have here a page that now lets you get only the external links. Doing etomite.com is interesting in itself with a weird one as well. I was really pleased with this but it is not perfect, but will do for my purposes for now.

Take a look now here and do etomite.com

What it does is to scrapte etomite.com for the list of urls, but ignoring all the internal links which I managed with this

foreach ($matches as $val) {

			// before outputting need to check that it is not an internal link
			// will test for the existance of http first, if not there could or must be an internal link and should not be displayed.

			$pos = strpos($val[0], "http");
			if ($pos === false) {
 				$output .= "The string HTTP was not found in the teststring";
			  } 
			else {
				// now we will explode again around the . again
				// this assumes no extra dots will appear
				// we will check to see if the first three pieces matches $domainname (set above) 
				// if not we can use it in our link list. If it does then forget it.

				// explode the val[0] on .
				$pieces = explode(".", $val[0]);

				// test 1 and 2 and 3 to see if they are not = domainname
				// the idea here is to catch http://www.mydomain.co.uk as well as http://docs.blog.mydomain.co.uk
				// I am assuming that 3 is enough and that if I only have two bits the if statement will still hold since false!=domainname
				if (($pieces[1]!=$domainname)&&($pieces[2]!=$domainname)&&($pieces[3]!=$domainname)) {

					// output domain if it obeys these criteria
					// $output2 .= $val[0] . "<br />";
					$pathlist[] .= $val[0];

				} // closes if piece1 = domain 

			  } // closes else from 'if pos === false'

		 }  // closes foreach


I was quite pleased with that.

Then it adds all the external links into an array, and picks a random element to display of the external links only

// does it have multiple entries
		elseif(count($pathlist)>=2) {

			// perform random pick of array key
			$randomlink = array_rand($pathlist);

			// output the random key from the original pathlist array
			$output2 .= $pathlist[$randomlink]."<br />";

		} // closes pathlist >2

This is in amongst testing to see if there are no links, one link or more. If more than 1 link I do this rand_array thing which is a super cool function (for silly things like this).

So now the page outputs a single random external link from the page scraped.

My mad idea is to now generate the next link in the path using the same process, but with the new external link. That means I will be making a virtual (albeit random at the moment) path through the web from the originally specified website. I am hoping that it will produce intresting results.

I am not sure if I will be putting excess load on the web server (shared) or even how I will manage the loop. But this is the most exciting thing I have done in ages! (How sad am I) I might even build an entire site around it and stick on some google ads or something similar. See if I can get the loop working first.

Paul.

PS When I get time of course.

#12 Dean

Dean

    Loves Etomite Forums!

  • Admin
  • 4,787 posts

Posted 18 November 2008 - 12:45 AM

Wont it put excess load on the server and the server on the receiving end? Also, it is possible to block it and manipulate the results :)

#13 Ralph

Ralph

    Loves Etomite Forums!

  • Admin
  • 6,539 posts

Posted 18 November 2008 - 12:49 AM

Looking good, Paul... We all need some time in the mad science lab once in a while... :lol:

#14 Dean

Dean

    Loves Etomite Forums!

  • Admin
  • 4,787 posts

Posted 18 November 2008 - 12:56 AM

What'd be really awesome would be for it to follow links and output a list of all the links on a site.. rather than just the page chosen.

#15 Ralph

Ralph

    Loves Etomite Forums!

  • Admin
  • 6,539 posts

Posted 18 November 2008 - 01:56 AM

What'd be really awesome would be for it to follow links and output a list of all the links on a site.. rather than just the page chosen.

You talking about spidering the entire site, Dean...???

#16 Jelmer

Jelmer

    Loves Etomite Forums!

  • Member
  • PipPipPipPip
  • 1,173 posts

Posted 18 November 2008 - 06:31 AM

I think they invented search engines for that purpose Dean...

#17 fishnchips

fishnchips

    Etomite Forum Fan

  • Member
  • Pip
  • 65 posts

Posted 18 November 2008 - 08:10 AM

But this is the most exciting thing I have done in ages! (How sad am I)

Lol :)

I agree an option for your script to be blocked would be a good thing. How about naming it a bot, and respecting a Disallow in robots.txt.

#18 Dean

Dean

    Loves Etomite Forums!

  • Admin
  • 4,787 posts

Posted 18 November 2008 - 08:35 AM

I think they invented search engines for that purpose Dean...

Yeah, an Entire Site... It'd be useful from a developer's point of view (if it also passed the properties of <a> tags through), to make sure that the properties were all the same throughout the site.
Also, what about listing the pages that they are on?

Search engine spiders are only as useful as the age of the information that they contain...

#19 Cris D.

Cris D.

    Loves Etomite Forums!

  • Developers
  • PipPipPipPip
  • 1,104 posts

Posted 18 November 2008 - 11:16 AM

I recon you have the beginning of a site aggregator... crawl a site to return all available feeds by searching for <rss version="2.0" ... <channel>..., users select the feeds they from the site, then display them in your etomite site.- Now THAT would increase a server's load with a range of users all embedding live feeds...

#20 PaulD

PaulD

    Likes Etomite Forums!

  • Developers
  • PipPip
  • 413 posts

Posted 18 November 2008 - 02:51 PM

@Dean

Wont it put excess load on the server and the server on the receiving end? Also, it is possible to block it and manipulate the results

I think it will but I will wait for my host to complain.

What'd be really awesome would be for it to follow links and output a list of all the links on a site.. rather than just the page chosen.

That could be done but would mean parsing hundreds of pages, not just a handful, so probably a bit beyond me really and my humble hosting facilities. Imagine the delays involved before the page returned!

@fishnchips

I agree an option for your script to be blocked would be a good thing. How about naming it a bot, and respecting a Disallow in robots.txt.

Really good point. I do not know how to do that but will look into it and add that. I did have notes to exclude sites like google etc that dont allow this in their terms. I should do this before making it public (although there are a hundred things to do before then with it. Just glad to get it working at all really.)

@Cris D.

I recon you have the beginning of a site aggregator

I am so pleased with this myself that I am already thinking of a hundred different things to do with it. I am also waiting for my host to complain!


Now this bit is done all the really hard stuff starts - the devil is in the detail as they say.

Will post anything I get working here just in case anyone is interested. Thanks for all the input though. Really appreciated.

Paul.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users