Current location: Hot Scripts Forums » General Community » Script Requests » I want a crowler and abstraction from Regular Expression


I want a crowler and abstraction from Regular Expression

Reply
  #1 (permalink)  
Old 10-28-04, 10:12 PM
oiranoinu oiranoinu is offline
Newbie Coder
 
Join Date: Jul 2004
Posts: 13
Thanks: 0
Thanked 0 Times in 0 Posts
Unhappy I want a crowler and abstraction from Regular Expression

Sorry I cant write english enough.

I want a crowler that this.

1.crowl the pages, first dedicating url from url list.
ex)
http://www.example.com/data/123.html
http://www.example.com/data/124.html
http://www.example.com/data/125.html
http://www.example.com/data/126.html
http://www.example.com/data/127.html
......
2.Using Regular Expression from tag to tag at crowled pages.
Of course if can use for plural cases
ex)
<tr><td width=100>aaaaaaaaaaaaaaaa</td></tr>
   it is abstracted, "aaaaaaaaaaaaaaaa"
plural cases...
<font size=8>zzzzzz</font>

3.And it will be result at test file or exerl file, like this.
ex)
http://www.example.com/data/123.html, aaaaaaaaaaaaaaaa;
http://www.example.com/data/124.html, bbbbbbbbbb;
http://www.example.com/data/125.html, ccccccccccccc;
http://www.example.com/data/126.html, dddddddd;
http://www.example.com/data/127.html, eeeeeee;
.....

plural cases...
ex)
http://www.example.com/data/123.html, aaaaaaaaaaaaaaaa, zzzzzz;
http://www.example.com/data/124.html, bbbbbbbbbb, yyyy;
http://www.example.com/data/125.html, ccccccccccccc, xxxxxx;
http://www.example.com/data/126.html, dddddddd, wwwwww;
http://www.example.com/data/127.html, eeeeeee, vvv;

Thanks!
. /\_/\
( ´ v `)
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #2 (permalink)  
Old 10-28-04, 11:12 PM
Sabu Sabu is offline
Junior Code Guru
 
Join Date: Sep 2004
Posts: 458
Thanks: 0
Thanked 0 Times in 0 Posts
You mean.. a crawler and extractor, huh? perhaps you might like to explore

http://hotscripts.com/PHP/Scripts_an...ing/index.html
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #3 (permalink)  
Old 11-05-04, 04:15 AM
oiranoinu oiranoinu is offline
Newbie Coder
 
Join Date: Jul 2004
Posts: 13
Thanks: 0
Thanked 0 Times in 0 Posts
Smile Thanks!

I will find from your advice.

Thanks!
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #4 (permalink)  
Old 11-25-04, 10:35 PM
oiranoinu oiranoinu is offline
Newbie Coder
 
Join Date: Jul 2004
Posts: 13
Thanks: 0
Thanked 0 Times in 0 Posts
Thanks but I cant find ! !

Yeah! I want a crawler and extractor
but I cant find any system that you advise.

http://hotscripts.com/PHP/Scripts_an...ing/index.html

mmm. Im at a loss what to do.

For Example...

[[[[Backpackers.com]]]]

1.Form the URLs and indicated Start and End of tags.

2.Crawl the URL first I indicated.
http://www.backpackers.com/directory...acking_gear/A/
http://www.backpackers.com/directory...acking_gear/C/
http://www.backpackers.com/directory...acking_gear/X/
http://www.backpackers.com/directory...acking_gear/Z/
…There are much URLs.

3.And extract the text from URLs using Regular Expression.

4.Output to CSV or text files...

===========================================
The case of http://www.backpackers.com/directory...acking_gear/G/
I want to extract...

Code:
Gear Zone - Outdoor Equipment and Clothing (London)  	
London, England Phone: 01603 630 298,
Source
Code:
		    <tr> 
                      <td class="directory">
		      
		      <table width="100%" cellspacing=0 cellpadding=0 border=0>
		      <tr><td><a name="2634"></a><a href="directory_popup.html?id=2634"  onMouseOver="window.status='http://www.gear-zone.co.uk/';return true" onMouseOut="window.status='';return true">Gear Zone - Outdoor Equipment and Clothing</a> <font color="#FF9933" size="1">(London)</font></td>

		      <td align="right"><!--<a href="tell_a_friend.html?id="><img src=/images/email.gif width=16 height=16 border=0 align=top></a> <a href="tell_a_friend.html?id=">Email to a friend</a>--></td></tr>
		      </table>
		      
		      </td>
		    </tr>

	
		    <tr> 
                      <td>
		      <!--<font size="2">UK specialist shop for outdoor, camping and travel related clothing and equipment.<br></font><br> -->
		      <font size="1">			London, 									England<br>

			Phone: 01603 630 298, 						<p><span class="email_company"><a href="mail_company.html?id=2634" class="email_company">email</a></span>
						

		      </font></td>
                    </tr>
Range of Inctation of tags
 Start of tag... The fisrt of "<tr><td><a name="
 Exd of tag... The fisrt of "</font></td></tr>"

===========================================
The case of http://www.backpackers.com/directory...acking_gear/H/
I want to extract...

Code:
Hike-Lite (Brighton)  	
PO Box 2085 Shoreham By Sea W. Sussex, Brighton, BN43 5XT, England
Phone: 01273 269789, Fax: 01273 381895,email
Source
Code:
		    <tr> 
                      <td class="directory">
		      
		      <table width="100%" cellspacing=0 cellpadding=0 border=0>
		      <tr><td><a name="3027"></a><a href="directory_popup.html?id=3027"  onMouseOver="window.status='http://www.hike-lite.co.uk';return true" onMouseOut="window.status='';return true">Hike-Lite</a> <font color="#FF9933" size="1">(Brighton)</font></td>

		      <td align="right"><!--<a href="tell_a_friend.html?id="><img src=/images/email.gif width=16 height=16 border=0 align=top></a> <a href="tell_a_friend.html?id=">Email to a friend</a>--></td></tr>
		      </table>
		      
		      </td>
		    </tr>

	
		    <tr> 
                      <td>
		      <!--<font size="2">Gear and advice for lighweight hikers and backpackers<br></font><br> -->
		      <font size="1">PO Box 2085 
Shoreham By Sea
W. Sussex, 			Brighton, 			BN43 5XT, 						England<br>

			Phone: 01273 269789, 			Fax: 01273 381895, 			<p><span class="email_company"><a href="mail_company.html?id=3027" class="email_company">email</a></span>
						

		      </font></td>
                    </tr>
[Range of Inctation of tags]
 Start of tag... The fisrt of "<tr><td><a name="
 Exd of tag... The fisrt of "</font></td></tr>"
===========================================



[[[[ebay.com]]]]

1.Crawl the URL first I indicated.
http://cgi.ebay.com/ws/eBayISAPI.dll...507550207&rd=1
http://cgi.ebay.com/ws/eBayISAPI.dll...507550184&rd=1
…There are much URLs.

3.And extract the text from URLs using Regular Expression.

4.Output to CSV or text files...
===========================================
The case of http://cgi.ebay.com/ws/eBayISAPI.dll...507550207&rd=1
I want to extract...

Code:
This is Cosmos by Carl Sagan. The con....mbine shipping.
Source
Code:
<!-- Begin Description -->

<TABLE CELLSPACING="28" CELLPADDING="0" WIDTH="100%">
						<TR>
							<TD VALIGN="top">
				This is Cosmos by Carl Sagan. The con....mbine shipping.

							</TD>
						</TR>
					</TABLE>
[Range of Inctation of tags]
 Start of tag... The fisrt of "<TABLE CELLSPACING="28" CELLPADDING="0" WIDTH="100%"><TR><TD VALIGN="top">"
 Exd of tag... The fisrt of "</TD></TR></TABLE>"

===========================================
The case of http://cgi.ebay.com/ws/eBayISAPI.dll...507550207&rd=1
I want to extract...

Code:
 This is a lot of 11 Sword and Sorcery paperbacks ....ere.Happy Holidays
Source
Code:
<!-- Begin Description -->

<TABLE CELLSPACING="28" CELLPADDING="0" WIDTH="100%">
						<TR>
							<TD VALIGN="top">
				 This is a lot of 11 Sword and Sorcery paperbacks ...ere.Happy Holidays

							</TD>
						</TR>
					</TABLE>
[Range of Inctation of tags]
 Start of tag... The fisrt of "<TABLE CELLSPACING="28" CELLPADDING="0" WIDTH="100%"><TR><TD VALIGN="top">"
 Exd of tag... The fisrt of "</TD></TR></TABLE>"

===========================================

I wrote much long. So if you know please advise me!!
Thanks!
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #5 (permalink)  
Old 11-26-04, 12:31 AM
oiranoinu oiranoinu is offline
Newbie Coder
 
Join Date: Jul 2004
Posts: 13
Thanks: 0
Thanked 0 Times in 0 Posts
I found but...

I found Web Data Extractor...
http://www.webextractor.com/

But it couldnt indicate the start and end of tags...
Is there any similas softwere ? ? ?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Forum Jump


All times are GMT -5. The time now is 10:49 AM.
vBulletin® Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.