Perl LWP loses html code?!

09-29-08, 01:55 PM
|
|
Newbie Coder
|
|
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
Perl LWP loses html code?!
Hi all,
I'm trying to write a script to download some webpages using LWP.
The problem is that the responses I'm getting are incomplete webpages - they only contain some of the content of what I see in my normal browser, omitting seemingly random tags - both comments, javascript and even forms. Somehow even a simple 'get' command yields this issue. I've tried using the ->as_string, ->content, and :content_file attributes, but all of them have the missing code problem.
I've tried it with other websites and it seems to work - is this caused by the website I'm trying to download from? My code? How can I get around it?
Any ideas?? Thanks!!
Here's the code:
-----------------
use LWP;
$ua =LWP::UserAgent->new;
$res = $ua->get("https://blah.html", ':content_file' => 'test.htm');
-----------------
As an example of the lost code, here's the code from going to the site and using "Save as" from the browser:
<script language="JavaScript"> FirstField="case_num";</script> <form enctype="multipart/form-data" method="post" action="/cgi-bin/iquery.pl?109027233035598-L_758_0-1">
<!--ShowPage(iquery.htm)--> <!-- rcsid="$Header: /usr/local/cvsroot/bankruptcy/web/html/iquery.htm,v 3.6 2005/02/07 20:00:34 gamores Exp $" -->
Here's what I get from the saved content file from "get":
<SCRIPT LANGUAGE="JavaScript"> FirstField="case_num";</SCRIPT><!-ShowPage(iquery.htm)-> <!-- rcsid="$Header: /usr/local/cvsroot/bankruptcy/web/html/iquery.htm,v 3.6 2005/02/07 20:00:34 gamores Exp $" -->
Notice that the form is gone and weirdly, the capitalization is also different. Does "get" reformat the code?
|

09-30-08, 01:20 AM
|
 |
Junior Code Guru
|
|
Join Date: May 2006
Posts: 555
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Quote:
Originally Posted by jialanw
Hi all,
I'm trying to write a script to download some webpages using LWP.
|
Umm... LWP is outdated and www::mechanize is more favored. Also..., what you posted wouldn't help Larry Wall himself debug your problem. PLEASE post real (entire relevant code) not the end result of the problem.
It would also help if you posted the web site URL you are attempting to COPY content from.
I strongly urge you to try www::mechanize as you will have a warm fuzzy feeling inside by quickly accomplishing your tasks with it.
__________________
Whatever you decide, you should make sure best security methods are used and practiced. Should you really need more help, PM me.
|

09-30-08, 10:10 AM
|
|
Newbie Coder
|
|
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Thanks for your reply. I tried using "get" from WWW:Mechanize and it has the same problem.
That IS the relevant code. I have code before it which logs in, but that seems to work fine. This is a US government site I'm downloading data from - it doesn't say anything about not allowing scraping.
If it somehow helps, here's the login code as well, using Mechanize. I have now tried everything I can think of, but there is still missing code. As I repeatedly run the exact same code, the results in the content_file actually change, dropping different bits of the site content!
For example, here is the same section of code from two different runs in the program. The form displayed on the browser should have "last_name", "first_name", then "middle_name". On the two runs, one dropped "first_name" and one dropped "last_name". Bizarre?
----------------
javascript Code:
for (var i = 0; i < document.forms[0].elements.length; i++) { if (document.forms[0].elements[i].name == "last_name" || document.forms[0].elements[i].name == "middle_name") --------------- for (var i = 0; i < document.forms[0].elements.length; i++) { document.forms[0].elements[i].name == "first_name" || document.forms[0].elements[i].name == "middle_name") ---------------
Code:
-------------
-----------------
Thanks
Last edited by Nico; 10-01-08 at 06:00 AM.
Reason: Wrappers.
|

10-01-08, 04:47 AM
|
 |
Junior Code Guru
|
|
Join Date: May 2006
Posts: 555
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Quote:
Originally Posted by jialanw
Thanks for your reply. I tried using "get" from WWW:Mechanize and it has the same problem.
|
1) Do you have everything needed for www::mechanize to construct a https session? Have you read the documentation on mechanize and https requests?
2) Your example code shows me that you must be fairly new to www::mechanize as YOUR CODE tries to complete a virtual form using the get method when what I would do is first have mech visit the main page, FILL OUT THE FORM, then copy the results of THAT PAGE. From that page, issue another query (form) and go from there. If you feel like paying for my time (not paying for the code), contact me.
__________________
Whatever you decide, you should make sure best security methods are used and practiced. Should you really need more help, PM me.
|

10-01-08, 06:54 PM
|
|
Newbie Coder
|
|
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
1) Yes, I have read the documentation for Mechanize and have installed Crypt::SSLeay and IO::Socket::SSL. From what I've read a simple https Get should work just fine with Mech.
2) It may seem weird to save a form page, but I actually do need to do that because some data I'm interested in is in the javascript of the page. The main point is that Get is dropping pieces of the page content from this site (whether it's a form or plain content), and I cannot for the life of me figure out why. Furthermore, it drops different bits of content on different runs of the exact same code.
Thanks,
|

10-02-08, 03:23 AM
|
 |
Junior Code Guru
|
|
Join Date: May 2006
Posts: 555
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Quote:
Originally Posted by jialanw
1) Yes, I have read the documentation for Mechanize and have installed Crypt::SSLeay and IO::Socket::SSL.
|
You are still not getting what I am saying. RETRIEVE THE PAGE (with www:mechanize), THEN HAVE MECH COMPLETE THE FORM!!!!!!
Quote:
Originally Posted by jialanw
2) It may seem weird to save a form page, but I actually do need to do that because some data I'm interested in is in the javascript of the page. The main point is that Get is dropping pieces of the page content from this site (whether it's a form or plain content), and I cannot for the life of me figure out why. Furthermore, it drops different bits of content on different runs of the exact same code.
Thanks,
|
No. You do not get what I am saying. This tells me that you must still be new to grabbing content from remote sites. If you tell mech to grab the page source (that is what MECHANIZE and LWP does), you can then tell it what to do with the contents of that page.
__________________
Whatever you decide, you should make sure best security methods are used and practiced. Should you really need more help, PM me.
|

10-03-08, 01:30 AM
|
|
Newbie Coder
|
|
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Right, what I'm saying is that I'm trying to get the contents of a page I'm interested in and then save those contents to a file. The problem is that the contents are not the full contents of the page - random bits of content are dropped.
Forget the form. Suppose I'm just trying to get a regular page. All I do is
$mech =WWW::Mechanize->new;
$r = $mech->get("https://ecf.nysb.uscourts.gov/cgi-bin/FilerQry.pl?158987",':content_file' => 'temp.html');
But comparing "temp.html" to the html displayed in a browser shows that random bits of html are missing. Moreover, different pieces of html are missing when I run the script successively.
Does that make sense?
|

10-08-08, 08:00 PM
|
|
New Member
|
|
Join Date: Oct 2008
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
I am facing the same problem
Hi,
I am facing the same issue. I am trying to get contents of a webpage using mechanize but output shows the incomplete contents as compared to what I see in source of that webpage in browser. Here is my code:
use WWW::Mechanize;
use HTTP::Cookies;
$user='my user id';
$pass='<some password>';
$url = 'https://login.postini.com/exec/login';
$m = WWW::Mechanize->new();
$m->cookie_jar(HTTP::Cookies->new);
$m->get($url);
$m->form_name("login");
$m->set_fields(emailid => $user, password => $pass);
$m->submit();
$m->follow_link( text => 'System');
$m->follow_link( text => 'Reports' ); ( Here I am trying to go to Reports link on current webpage)
print $m->content(); (and when I am on Reports page, it does not show full contents of webpage.)
The script is working as intended. It is just that since I am not getting full contents I am not able to proceed furthur to what I want to achieve.
Can anybody assist?
Thanks
Saurya
|

10-08-08, 08:05 PM
|
|
Newbie Coder
|
|
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
I solved it - I am now running the script on a server and am now capturing the full content of the webpages.
I still don't know the actual cause - must be something on the back end of my network (I'm on a university network). I've already tried different machines, wired vs. wireless, etc, but the server is the only thing that has worked for me.
So it seems that if you are having this problem, try running your script on different networks and hope that one works.
|

10-09-08, 12:50 AM
|
 |
Junior Code Guru
|
|
Join Date: May 2006
Posts: 555
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Quote:
Originally Posted by jialanw
I still don't know the actual cause - must be something on the back end of my network (I'm on a university network). I've already tried different machines, wired vs. wireless, etc, but the server is the only thing that has worked for me.
|
Of course! A university WILL filter content coming through their network!!! No wonder you had problems.
__________________
Whatever you decide, you should make sure best security methods are used and practiced. Should you really need more help, PM me.
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
|
|
|
| Thread Tools |
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|