Current location: Hot Scripts Forums » Programming Languages » Perl » Perl LWP loses html code?!


Perl LWP loses html code?!

Reply
  #1 (permalink)  
Old 09-29-08, 01:55 PM
jialanw jialanw is offline
Newbie Coder
 
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Perl LWP loses html code?!

Hi all,

I'm trying to write a script to download some webpages using LWP.

The problem is that the responses I'm getting are incomplete webpages - they only contain some of the content of what I see in my normal browser, omitting seemingly random tags - both comments, javascript and even forms. Somehow even a simple 'get' command yields this issue. I've tried using the ->as_string, ->content, and :content_file attributes, but all of them have the missing code problem.

I've tried it with other websites and it seems to work - is this caused by the website I'm trying to download from? My code? How can I get around it?

Any ideas?? Thanks!!

Here's the code:
-----------------
use LWP;

$ua =LWP::UserAgent->new;
$res = $ua->get("https://blah.html", ':content_file' => 'test.htm');
-----------------

As an example of the lost code, here's the code from going to the site and using "Save as" from the browser:

<script language="JavaScript"> FirstField="case_num";</script> <form enctype="multipart/form-data" method="post" action="/cgi-bin/iquery.pl?109027233035598-L_758_0-1">
<!--ShowPage(iquery.htm)--> <!-- rcsid="$Header: /usr/local/cvsroot/bankruptcy/web/html/iquery.htm,v 3.6 2005/02/07 20:00:34 gamores Exp $" -->


Here's what I get from the saved content file from "get":

<SCRIPT LANGUAGE="JavaScript"> FirstField="case_num";</SCRIPT><!-ShowPage(iquery.htm)-> <!-- rcsid="$Header: /usr/local/cvsroot/bankruptcy/web/html/iquery.htm,v 3.6 2005/02/07 20:00:34 gamores Exp $" -->


Notice that the form is gone and weirdly, the capitalization is also different. Does "get" reformat the code?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #2 (permalink)  
Old 09-30-08, 01:20 AM
curbview.com's Avatar
curbview.com curbview.com is offline
Junior Code Guru
 
Join Date: May 2006
Posts: 555
Thanks: 0
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by jialanw View Post
Hi all,
I'm trying to write a script to download some webpages using LWP.
Umm... LWP is outdated and www::mechanize is more favored. Also..., what you posted wouldn't help Larry Wall himself debug your problem. PLEASE post real (entire relevant code) not the end result of the problem.

It would also help if you posted the web site URL you are attempting to COPY content from.

I strongly urge you to try www::mechanize as you will have a warm fuzzy feeling inside by quickly accomplishing your tasks with it.
__________________
Whatever you decide, you should make sure best security methods are used and practiced. Should you really need more help, PM me.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #3 (permalink)  
Old 09-30-08, 10:10 AM
jialanw jialanw is offline
Newbie Coder
 
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Thanks for your reply. I tried using "get" from WWW:Mechanize and it has the same problem.

That IS the relevant code. I have code before it which logs in, but that seems to work fine. This is a US government site I'm downloading data from - it doesn't say anything about not allowing scraping.

If it somehow helps, here's the login code as well, using Mechanize. I have now tried everything I can think of, but there is still missing code. As I repeatedly run the exact same code, the results in the content_file actually change, dropping different bits of the site content!

For example, here is the same section of code from two different runs in the program. The form displayed on the browser should have "last_name", "first_name", then "middle_name". On the two runs, one dropped "first_name" and one dropped "last_name". Bizarre?

----------------
javascript Code:
  1. for (var i = 0; i < document.forms[0].elements.length; i++)
  2.     {
  3.         if (document.forms[0].elements[i].name == "last_name" ||
  4.             document.forms[0].elements[i].name == "middle_name")
  5. ---------------
  6.     for (var i = 0; i < document.forms[0].elements.length; i++)
  7.     {
  8.             document.forms[0].elements[i].name == "first_name" ||
  9.             document.forms[0].elements[i].name == "middle_name")
  10. ---------------
Code:
-------------
Code:
use HTTP::Cookies;
use WWW::Mechanize;

$ua =WWW::Mechanize->new;

    my $r = $ua->simple_request(POST 'https://pacer.login.uscourts.gov/cgi-bin/check-pacer-passwd.pl',
                               {
                                url => '',
                                loginid => 'login',
                                passwd => 'pwd'
                               });

    while ($r->is_redirect) {
        my $u = $r->header('location') or die "missing location: ", $r->as_string;
        print "redirecting to $u\n";
        $r = $ua->simple_request(GET $u);
    }


$r = $ua->get("https://ecf.$district.uscourts.gov/cgi-bin/iquery.pl", ':content_file' => 'test.html');
-----------------

Thanks

Last edited by Nico; 10-01-08 at 06:00 AM. Reason: Wrappers.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #4 (permalink)  
Old 10-01-08, 04:47 AM
curbview.com's Avatar
curbview.com curbview.com is offline
Junior Code Guru
 
Join Date: May 2006
Posts: 555
Thanks: 0
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by jialanw View Post
Thanks for your reply. I tried using "get" from WWW:Mechanize and it has the same problem.
1) Do you have everything needed for www::mechanize to construct a https session? Have you read the documentation on mechanize and https requests?

2) Your example code shows me that you must be fairly new to www::mechanize as YOUR CODE tries to complete a virtual form using the get method when what I would do is first have mech visit the main page, FILL OUT THE FORM, then copy the results of THAT PAGE. From that page, issue another query (form) and go from there. If you feel like paying for my time (not paying for the code), contact me.
__________________
Whatever you decide, you should make sure best security methods are used and practiced. Should you really need more help, PM me.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #5 (permalink)  
Old 10-01-08, 06:54 PM
jialanw jialanw is offline
Newbie Coder
 
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
1) Yes, I have read the documentation for Mechanize and have installed Crypt::SSLeay and IO::Socket::SSL. From what I've read a simple https Get should work just fine with Mech.

2) It may seem weird to save a form page, but I actually do need to do that because some data I'm interested in is in the javascript of the page. The main point is that Get is dropping pieces of the page content from this site (whether it's a form or plain content), and I cannot for the life of me figure out why. Furthermore, it drops different bits of content on different runs of the exact same code.

Thanks,
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #6 (permalink)  
Old 10-02-08, 03:23 AM
curbview.com's Avatar
curbview.com curbview.com is offline
Junior Code Guru
 
Join Date: May 2006
Posts: 555
Thanks: 0
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by jialanw View Post
1) Yes, I have read the documentation for Mechanize and have installed Crypt::SSLeay and IO::Socket::SSL.
You are still not getting what I am saying. RETRIEVE THE PAGE (with www:mechanize), THEN HAVE MECH COMPLETE THE FORM!!!!!!

Quote:
Originally Posted by jialanw View Post
2) It may seem weird to save a form page, but I actually do need to do that because some data I'm interested in is in the javascript of the page. The main point is that Get is dropping pieces of the page content from this site (whether it's a form or plain content), and I cannot for the life of me figure out why. Furthermore, it drops different bits of content on different runs of the exact same code.

Thanks,
No. You do not get what I am saying. This tells me that you must still be new to grabbing content from remote sites. If you tell mech to grab the page source (that is what MECHANIZE and LWP does), you can then tell it what to do with the contents of that page.
__________________
Whatever you decide, you should make sure best security methods are used and practiced. Should you really need more help, PM me.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #7 (permalink)  
Old 10-03-08, 01:30 AM
jialanw jialanw is offline
Newbie Coder
 
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Right, what I'm saying is that I'm trying to get the contents of a page I'm interested in and then save those contents to a file. The problem is that the contents are not the full contents of the page - random bits of content are dropped.

Forget the form. Suppose I'm just trying to get a regular page. All I do is

$mech =WWW::Mechanize->new;
$r = $mech->get("https://ecf.nysb.uscourts.gov/cgi-bin/FilerQry.pl?158987",':content_file' => 'temp.html');

But comparing "temp.html" to the html displayed in a browser shows that random bits of html are missing. Moreover, different pieces of html are missing when I run the script successively.

Does that make sense?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #8 (permalink)  
Old 10-08-08, 08:00 PM
saurya1979 saurya1979 is offline
New Member
 
Join Date: Oct 2008
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
I am facing the same problem

Hi,

I am facing the same issue. I am trying to get contents of a webpage using mechanize but output shows the incomplete contents as compared to what I see in source of that webpage in browser. Here is my code:

use WWW::Mechanize;
use HTTP::Cookies;

$user='my user id';
$pass='<some password>';

$url = 'https://login.postini.com/exec/login';

$m = WWW::Mechanize->new();

$m->cookie_jar(HTTP::Cookies->new);

$m->get($url);

$m->form_name("login");

$m->set_fields(emailid => $user, password => $pass);

$m->submit();

$m->follow_link( text => 'System');

$m->follow_link( text => 'Reports' ); ( Here I am trying to go to Reports link on current webpage)

print $m->content(); (and when I am on Reports page, it does not show full contents of webpage.)

The script is working as intended. It is just that since I am not getting full contents I am not able to proceed furthur to what I want to achieve.

Can anybody assist?

Thanks
Saurya
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #9 (permalink)  
Old 10-08-08, 08:05 PM
jialanw jialanw is offline
Newbie Coder
 
Join Date: Sep 2008
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
I solved it - I am now running the script on a server and am now capturing the full content of the webpages.

I still don't know the actual cause - must be something on the back end of my network (I'm on a university network). I've already tried different machines, wired vs. wireless, etc, but the server is the only thing that has worked for me.

So it seems that if you are having this problem, try running your script on different networks and hope that one works.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #10 (permalink)  
Old 10-09-08, 12:50 AM
curbview.com's Avatar
curbview.com curbview.com is offline
Junior Code Guru
 
Join Date: May 2006
Posts: 555
Thanks: 0
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by jialanw View Post
I still don't know the actual cause - must be something on the back end of my network (I'm on a university network). I've already tried different machines, wired vs. wireless, etc, but the server is the only thing that has worked for me.
Of course! A university WILL filter content coming through their network!!! No wonder you had problems.
__________________
Whatever you decide, you should make sure best security methods are used and practiced. Should you really need more help, PM me.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Displaying html code jbsniff HTML/XHTML/XML 3 01-31-07 01:12 PM
php code for convert html output to pdf vanisridesu PHP 3 01-16-07 05:57 AM
HTML Form 1 -> Perl -> return response to HTML form 2 Oleks Perl 13 10-18-06 05:59 PM
convert perl code to php phptalk Perl 1 01-15-04 03:06 AM
Perl code : print "." x 20; - How to do it with PHP ? kevin PHP 2 07-04-03 05:29 AM


All times are GMT -5. The time now is 06:36 AM.
vBulletin® Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.