Current location: Hot Scripts Forums » Programming Languages » Perl » Perl Parsing Script


Perl Parsing Script

Reply
  #1 (permalink)  
Old 12-11-03, 01:18 PM
DeerHunter DeerHunter is offline
New Member
 
Join Date: Dec 2003
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Perl Parsing Script

I am a real Perl newbie and could use a little beginners advice. I have this file that contains email addresses in the format email1@email.com; email2@email.com; email3@email.com;

I want to read in this file with a Perl script and replace the ; (and space) with a \n (new line) so that the above list would like:

email1@email.com
email2@email.com
email3@email.com

The resulting file would be in this format. Can anyone out there give this newbie a hand with this?

Thanks

OH - I have installed Active State's Active Perl on my local machine. It would be really nice to create a script file that could be used to do this on my Web Hosting account (Unix Box).
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #2 (permalink)  
Old 12-11-03, 09:03 PM
Millennium's Avatar
Millennium Millennium is offline
Wannabe Coder
 
Join Date: Nov 2003
Posts: 136
Thanks: 0
Thanked 0 Times in 0 Posts
There is not many differences between coding Perl scripts for Unix and windows, thankfully, at least not for the coommon stuff so its almost never a big worry.
Now on to your question, I am unsure if you want the list printed to file or just the screen but here is an example of how it can be done:

Code:
#!/usr/bin/perl
use CGI qw/:standard/;

$yourfile= 'path/to/filetest.txt';#file with the emails

open (FILE, "$yourfile") || die "Can't open '$yourfile': $!\n";
@emails = <FILE>;
close(FILE);

foreach (@emails) {
   chomp;
   $_ =~ s/;//g; #get rid of ;
   push @new_emails, split(/ /);#split on the spaces and put in @new_email array
}

#print to the file
open (FILE, ">$yourfile") || die "Can't open '$yourfile': $!\n";#overwrite the old file in the new format
print FILE "$_\n" for @new_emails;
close(FILE);

#print something to the screen
print header();
print start_html();
print "Emails have been converted<br>\n";
print "$_<br>\n" for @new_emails;
print end_html;

Last edited by Millennium; 12-11-03 at 09:07 PM.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #3 (permalink)  
Old 04-11-05, 12:30 AM
kofcrazy kofcrazy is offline
Newbie Coder
 
Join Date: Apr 2005
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
I have a problem similar to this and I'm also new to perl so any suggestion would be appreciated.

I have tons of emails in this kind of format:

Return-Path: dkg@sparrow.spearhead.net
Received: from linus.vsource.com (root@linus.vsource.com [198.169.201.2]) by hal.qcc.sk.ca (8.8.0/8.7.3) with ESMTP id VAA08699 for <bguenter@hal.qcc.sk.ca>; Fri, 12 Dec 1997 21:47:07 -0600
From: dkg@sparrow.spearhead.net
Received: from sparrow.spearhead.net ([209.136.73.165]) by linus.vsource.com (8.8.0/8.6.9) with ESMTP id VAA02022 for <bguenter@gemprint.com>; Fri, 12 Dec 1997 21:46:56 -0600
Received: by sparrow.spearhead.net (8.8.4/8.8.4) with SMTP
id LAA23280; Sat, 13 Dec 1997 11:34:13 -0500
Date: Sat, 13 Dec 1997 11:34:13 -0500
Message-Id: <199712131634.LAA23280@sparrow.spearhead.net>
To: dkg@sparrow.spearhead.net
Subject: A Personal Message...

i need to extract from these types of emails the IP address and the date.
There are two IPs but I only need the ones from the first Received:
so in this case it would be 198.169.201.2 and the date is 12 Dec 1997

once I have these two things, I need them to be written to a file (ipdate) in this format
198.169.201.2, 19971212

And then I would do the same procedure with another email stored in another file and append the IP and date from there to ipdate.

This is a lot to ask but I'm kinda clueless on where to begin. I tried to modify the code from above, but so far I have been unsuccessful.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #4 (permalink)  
Old 04-11-05, 05:51 PM
Chas Chas is offline
Coding Addict
 
Join Date: Oct 2003
Location: California
Posts: 359
Thanks: 0
Thanked 0 Times in 0 Posts
I thikn you're going to have a hard time with this one. If I'm not mistaken the IP address will not always be there and the exact format can vary. You'll have to tweak the regex to fit:

Code:
#!/usr/bin/perl
use strict;
use warnings;
use Time::Piece;
use CGI::Carp qw/fatalsToBrowser/;

print "Content-Type: Text/HTML\n\n";
print "<pre>\n";
foreach my $header (<DATA>) {
  next unless $header =~ /^Received: from/;
  my ($ip, $date) = $header =~ /\[(.*)\]\).*>; (.*)$/;
  if ($ip) {
    my $t = Time::Piece->strptime($date);
    my $date = $t->year . $t->mon . $t->mday;
    print "$ip, $date\n";
    last;
  }
}
print "</pre>\n";

__DATA__
Return-Path: dkg@sparrow.spearhead.net
Received: from linus.vsource.com (root@linus.vsource.com [198.169.201.2]) by hal.qcc.sk.ca (8.8.0/8.7.3) with ESMTP id VAA08699 for <bguenter@hal.qcc.sk.ca>; Fri, 12 Dec 1997 21:47:07 -0600
From: dkg@sparrow.spearhead.net
Received: from sparrow.spearhead.net ([209.136.73.165]) by linus.vsource.com (8.8.0/8.6.9) with ESMTP id VAA02022 for <bguenter@gemprint.com>; Fri, 12 Dec 1997 21:46:56 -0600
Received: by sparrow.spearhead.net (8.8.4/8.8.4) with SMTP
id LAA23280; Sat, 13 Dec 1997 11:34:13 -0500
Date: Sat, 13 Dec 1997 11:34:13 -0500
Message-Id: <199712131634.LAA23280@sparrow.spearhead.net>
To: dkg@sparrow.spearhead.net
Subject: A Personal Message...
You'll have to work out looping though your messages/folders bit but that is the easy part

You may also want to look at the Mail::Box[1] suite of modules. That will give you a nice set of objects to manipulate the folders/messages/headers.

~Charlie

[1] http://search.cpan.org/~markov/Mail-...b/Mail/Box.pod
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #5 (permalink)  
Old 04-12-05, 01:34 AM
kofcrazy kofcrazy is offline
Newbie Coder
 
Join Date: Apr 2005
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Thanks for the help, but I'm still a little lost. The emails I have are all stored in files. One email per file, and all of the emails are in this format:

Return-Path: dkg@sparrow.spearhead.net
Received: from linus.vsource.com (root@linus.vsource.com [198.169.201.2]) by hal.qcc.sk.ca (8.8.0/8.7.3) with ESMTP id VAA08699 for <bguenter@hal.qcc.sk.ca>; Fri, 12 Dec 1997 21:47:07 -0600
From: dkg@sparrow.spearhead.net
Received: from sparrow.spearhead.net ([209.136.73.165]) by linus.vsource.com (8.8.0/8.6.9) with ESMTP id VAA02022 for <bguenter@gemprint.com>; Fri, 12 Dec 1997 21:46:56 -0600
Received: by sparrow.spearhead.net (8.8.4/8.8.4) with SMTP
id LAA23280; Sat, 13 Dec 1997 11:34:13 -0500
Date: Sat, 13 Dec 1997 11:34:13 -0500
Message-Id: <199712131634.LAA23280@sparrow.spearhead.net>
To: dkg@sparrow.spearhead.net
Subject: A Personal Message...

so basically I want to write a code that opens a given file, extract the ip and date and append the information into another designated file.

modifying the code from above, I have this:

#!/usr/bin/perl
use strict;
use warnings;
use Time:iece;
use CGI::Carp qw/fatalsToBrowser/;

print "Content-Type: Text/HTML\n\n";
print "<pre>\n";

$yourfile= 'path/to/filetest.txt'; #file with the spam emails
open (DATA, "$yourfile") || die "Can't open '$yourfile': $!\n";

foreach my $header (<DATA>) {
next unless $header =~ /^Received: from/;
my ($ip, $date) = $header =~ /\[(.*)\]\).*>; (.*)$/;
if ($ip) {
my $t = Time:iece->strptime($date);
my $date = $t->year . $t->mon . $t->mday;
print "$ip, $date\n";
last;
}
}
close (DATA);

#print to the file
open (FILE, ">>$ipdate") || die "Can't open '$ipdate': $!\n";#append the ip to ipdate
print FILE (my $ip, my $date);
close (FILE);

print "</pre>\n";

i'm not sure if I'm going about this the right way and any comments would be greatly appreciated.
Also, can someone explain to me what this line does:
my ($ip, $date) = $header =~ /\[(.*)\]\).*>; (.*)$/;

Thanks
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #6 (permalink)  
Old 04-14-05, 03:20 PM
Chas Chas is offline
Coding Addict
 
Join Date: Oct 2003
Location: California
Posts: 359
Thanks: 0
Thanked 0 Times in 0 Posts
See if this helps:

Code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Time::Piece;

# Path to the directory containing the e-mails
my $maildir = '/home/public_html/cgi-bin/messages';
# Path to the file storing the IP & Date
my $dest    = '/home/public_html/cgi-bin/ipdate';

print "\n\n";
# Open up the file where you want to store the IP, Date
open IP, ">> $dest" or die "Can't open '$dest': $!\n";
# Loop through all the messages in the directory
foreach my $msg (messages($maildir)) {
  print "Reading $msg...";
  # Open the message to parse
  open MSG, $msg or die "Failed to open $msg: $!";
  # Loop though each line looking for the header
  foreach my $header (<MSG>) {
    # We don't want to parse if the line doesn't start with 'Received: from'
    next unless $header =~ /^Received: from/;
    # Get he IP and timestamp
    my ($ip, $ts) = $header =~ /\[(.*)\]\).*>; (.*)$/;
    # If we didn't filter out an IP we have the wrong Received: line, move along
    next unless $ip;
    print " found IP: $ip.  Inserting...\n";
    # At this point we have a good IP.  Lets write it to the DB*
    print IP "$ip, ${\date($ts)}\n";
    # * We have to convert the timestamp to the date format that you want. 
    # See the date sub below.
  }
  close MSG;
  print " Done.\n";
}
close IP;


#====================================================================
# Subs
#====================================================================
sub messages {
  # We're using File::Find here to read through the directory and find
  # files only.  You may need to tweak the regex if you have files other
  # than e-mails in this directory.
  my $dir = shift or return;
  
  my @messages;
  find(sub { next if /^\.+$/ || -d $File::Find::name; push @messages, $File::Find::name }, $dir);
  
  return @messages;
}

sub date {
  # We're using this to reformat the timestamp into the format
  # you want.
  my $t = Time::Piece->strptime(shift);
  return $t->year . $t->mon . $t->mday;
}

__END__
I tried to comment the code where I thought it might need explanation. Just let me know if you need me to explain some parts better.

That snippet of code you asked about is a regex to filter out the specific data you're looking for in the header. There's a lot to explain there and I really couldn't do it justice. Take a look at the Perl faq on regualr expressions: http://www.perldoc.com/perl5.8.4/pod/perlrequick.html.

~Charlie
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #7 (permalink)  
Old 04-19-05, 03:11 AM
kofcrazy kofcrazy is offline
Newbie Coder
 
Join Date: Apr 2005
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Thanks for the help. Unfortunately I haven't had the chance to test it because I'm having trouble installing the DateTime module. At the command prompt, i type C:\perl\lib>perl -MCPAN -e "install DateTime"
and then it gives me this:

...
# running Build.PL
C:\Perl\bin\perl.exe -Ilib Build.PL
Checking whether your kit is complete...
Looks good
* Optional prerequisite Module::Signature isn't installed
* Optional prerequisite ExtUtils:arseXS isn't installed
* Optional prerequisite ExtUtils::CBuilder isn't installed
ERRORS/WARNINGS FOUND IN PREREQUISITES. You may wish to install the versions
of the modules indicated above before proceeding with this installation.

Feature 'YAML_support' enabled.

Creating new 'Build' script for 'Module-Build' version '0.2610'
-- OK
Running make test
'test' is not recognized as an internal or external command,
operable program or batch file.
test -- NOT OK
Running make install
make test had returned bad status, won't install without force
*** Cannot install without Module::Build. Exiting ...
Running make test
Make had some problems, maybe interrupted? Won't test
Running make install
Make had some problems, maybe interrupted? Won't install
Running install for module Params::Validate
Running make for D/DR/DROLSKY/Params-Validate-0.76.tar.gz
Checksum for \.cpan\sources\authors\id\D\DR\DROLSKY\Params-Validate-0.76.tar.gz
ok
Params-Validate-0.76/
Params-Validate-0.76/t/
Params-Validate-0.76/t/13-taint.t
Params-Validate-0.76/t/18-depends.t
Params-Validate-0.76/t/with.pl
Params-Validate-0.76/t/08-noop_with.t
Params-Validate-0.76/t/10-noop_regex.t
Params-Validate-0.76/t/03-attribute.t
Params-Validate-0.76/t/defaults.pl
Params-Validate-0.76/t/callbacks.pl
Params-Validate-0.76/t/07-with.t
Params-Validate-0.76/t/16-normalize.t
Params-Validate-0.76/t/regex.pl
Params-Validate-0.76/t/tests.pl
Params-Validate-0.76/t/12-noop_cb.t
Params-Validate-0.76/t/05-noop_default.t
Params-Validate-0.76/t/21-can.t
Params-Validate-0.76/t/04-defaults.t
Params-Validate-0.76/t/19-untaint.t
Params-Validate-0.76/t/02-noop.t
Params-Validate-0.76/t/01-validate.t
Params-Validate-0.76/t/11-cb.t
Params-Validate-0.76/t/15-case.t
Params-Validate-0.76/t/17-callbacks.t
Params-Validate-0.76/t/14-no_validate.t
Params-Validate-0.76/t/06-options.t
Params-Validate-0.76/t/09-regex.t
Params-Validate-0.76/Changes
Params-Validate-0.76/lib/
Params-Validate-0.76/lib/Params/
Params-Validate-0.76/lib/Params/ValidateXS.pm
Params-Validate-0.76/lib/Params/ValidatePP.pm
Params-Validate-0.76/lib/Params/Validate.pm
Params-Validate-0.76/lib/Attribute/
Params-Validate-0.76/lib/Attribute/Params/
Params-Validate-0.76/lib/Attribute/Params/Validate.pm
Params-Validate-0.76/MANIFEST
Params-Validate-0.76/TODO
Params-Validate-0.76/META.yml
Params-Validate-0.76/ppport.h
Params-Validate-0.76/Validate.xs
Params-Validate-0.76/LICENSE
Params-Validate-0.76/Makefile.PL
Params-Validate-0.76/README
Removing previously used \.cpan\build\Params-Validate-0.76

CPAN.pm: Going to build D/DR/DROLSKY/Params-Validate-0.76.tar.gz

Testing if you have a C compiler

Microsoft (R) Program Maintenance Utility Version 7.10.3077
Copyright (C) Microsoft Corporation. All rights reserved.

cl /c test.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

test.c

*** NOTE ***

You can safely ignore the warnings below about 'Too late to run
CHECK/INIT blocks'.

*************

Checking if your kit is complete...
Looks good
Writing Makefile for Params::Validate
-- OK
Running make test
'test' is not recognized as an internal or external command,
operable program or batch file.
test -- NOT OK
Running make install
make test had returned bad status, won't install without force
Running make for D/DR/DROLSKY/DateTime-0.28.tar.gz
Is already unwrapped into directory \.cpan\build\DateTime-0.28

CPAN.pm: Going to build D/DR/DROLSKY/DateTime-0.28.tar.gz

-- OK
Running make test
'test' is not recognized as an internal or external command,
operable program or batch file.
test -- NOT OK
Running make install
make test had returned bad status, won't install without force


I omitted a bunch of stuff on top because I don't think the problem lies there. I have Visual Studio .net 2003 installed. I think the problem has something to do with test (Running make test
'test' is not recognized as an internal or external command). I'm not sure and I looked for other ways to install the DateTime module but with lil success.

Sida
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #8 (permalink)  
Old 04-19-05, 03:32 PM
Chas Chas is offline
Coding Addict
 
Join Date: Oct 2003
Location: California
Posts: 359
Thanks: 0
Thanked 0 Times in 0 Posts
Use PPM instead if you are running ActiveState Perl:

Code:
C:> ppm install Time::Piece
You might have to find a repository that has it though. There is a nice list of additional PPM reps here: http://crazyinsomniac.perlmonk.org/perl/ppm/5.8/

~Charlie
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #9 (permalink)  
Old 04-20-05, 02:07 AM
kofcrazy kofcrazy is offline
Newbie Coder
 
Join Date: Apr 2005
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Thanks Charlie, I got almost everything to work. I have two more things to ask. Can you explain this line of coding

my ($ip, $ts) = $header =~ /\[(.*)\]\).*>; (.*)$/;

because some of the email headers differs from the email header I posted. So I need to know how this line works and its syntax so I can modify it.
For example, it couldn't find the ip and date of this email:

Return-Path: <privateertuwl@boxfrog.com>
Delivered-To: em-ca-bait-excelled@em.ca
Received: (qmail 752 invoked from network); 31 Oct 2003 23:13:30 -0000
Received: from c-24-1-237-171.client.comcast.net (HELO boxfrog.com) (24.1.237.171)
by churchill.factcomp.com with SMTP; 31 Oct 2003 23:13:30 -0000
Received: from unknown (77.147.134.86)
by mail.gimmicc.net with esmtp; Sat, 01 Nov 2003 22:11:48 +0900
Received: from unknown (213.56.7.98)
by mail.webhostings4u.com with NNFMP; 02 Nov 2003 07:04:35 -0200
Received: from unknown (HELO smtp18.yenddx.com) (200.141.45.150)
by smtp4.cyberemailings.com with smtp; 02 Nov 2003 04:57:22 -0600
Message-ID: <9d3001c3a0c0$420cdc90$af916784@kdlxtqaaf>
Reply-To: <privateertuwl@boxfrog.com>
From: <privateertuwl@boxfrog.com>
To: "polemicmailcity" <bait-excelled@em.ca>
Subject: Lillian is a Cartoon looking for a master
Date: Sat, 01 Nov 2003 13:36:49 -0800


And when I run the code on the folder containing the emails, I get a lot of this:

Reading c:/2003/2003/10/1067645710.15548_35.txt... Done.
Reading c:/2003/2003/10/1067645711.15548_37.txt... Done.
Reading c:/2003/2003/10/1067645712.15548_40.txt... Done.


Oh I would like to extract only the last ip, or the first ip from bottom to top. In the case above it would be 200.141.45.150. Once I get this ip and the date that accompanies it, I would like to stop searching this email for any more ips. I was thinking to read starting from the end of the file and stop once it finds one ip. I'm not sure if this can be done.

Thanks
Sida
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
  #10 (permalink)  
Old 04-24-05, 01:46 AM
Chas Chas is offline
Coding Addict
 
Join Date: Oct 2003
Location: California
Posts: 359
Thanks: 0
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by kofcrazy
Thanks Charlie, I got almost everything to work. I have two more things to ask. Can you explain this line of coding

my ($ip, $ts) = $header =~ /\[(.*)\]\).*>; (.*)$/;
See the link I posted two messages ago. It links to the Perl regex beginner's guide that will explain most of that. I still struggle with regexes myself and I would have a hard time putting that regex to words that would make sense to you.


Quote:
because some of the email headers differs from the email header I posted. So I need to know how this line works and its syntax so I can modify it.
For example, it couldn't find the ip and date of this email:
[snip /]
I figured as much. That's why I mentioned that in my original post. This varies from mail server to mail server and it's going to be difficult at best to nail this down to one regex. At least it would be for me


Quote:
And when I run the code on the folder containing the emails, I get a lot of this:

Reading c:/2003/2003/10/1067645710.15548_35.txt... Done.
Reading c:/2003/2003/10/1067645711.15548_37.txt... Done.
Reading c:/2003/2003/10/1067645712.15548_40.txt... Done.
I like to add print statements so I can see what's going on. You can remove those if you don't want to see that output or add a verbose flag to enable/disable the print statements.


Quote:
Oh I would like to extract only the last ip, or the first ip from bottom to top. In the case above it would be 200.141.45.150. Once I get this ip and the date that accompanies it, I would like to stop searching this email for any more ips. I was thinking to read starting from the end of the file and stop once it finds one ip. I'm not sure if this can be done.
Of course it can be done Try something like this when reading in the file:

Code:
foreach my $header (reverse(<MSG>)) {
It already stops once it finds the first occurance of an IP address and moves on to the next e-mail.


Quote:
Thanks
Sida
No worries.

~Charlie
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiShare on FacebookShare on Stumble UponShare on Twitter
Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
syntax for adding a css style to a perl script please arpana Perl 1 11-08-03 06:31 AM
newbie perl script to call an array in a subroutine and add 1 Arowana Perl 1 10-31-03 02:04 PM
unsing win32 perl script run lotus notes mail josephg Perl 0 10-26-03 11:19 PM
Perl CGI wrapper script available? paulbearer Script Requests 0 09-17-03 04:18 PM
need perl script! dip Script Requests 3 09-08-03 11:57 AM


All times are GMT -5. The time now is 08:57 AM.
vBulletin® Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.