PDA

View Full Version : Question about file splitting


Socraties
10-28-2002, 07:39 PM
Here is what I have to do. I have a file that needs to be parsed out and split into smaller files based on a few characters in each line of the file. The character mapping i have in the file is this:

N 2425 name address phone

The number 2425 is a variable number sometimes the same and sometimes it is different. I need to be able to split this file out by this mapping.

if <= '2425' files will go to file1
if > '2425' && <='46' files will go to file 2
if > '46' && <= '677' files will go to file 3
if > '677' && <= 'E02 files will go to file 4
if > 'E02' files will go to file 5.

Here is what I have come up with, I was wondering if you guys could tell me if what I have written will work:

$first_split = substr($line,2,4);
$second_split = substr($line,2,3);
$third_split = substr($line,2,2);

$i = 0;
foreach $line (<BIG_FILE>){
if($first_split <= '2425'){
print FIRST_PART $line;
}elsif ($first_split > '2425' && $third_split <= '46'){
print SECOND_PART $line;
}elsif ($third_split > '46' && $second_split <= '677'){
print THIRD_PART $line;
}elsif ($second_split > '677' && $second_split <= 'E02'){
print FOURTH_PART $line;
}elsif ($second_split > 'E02'){
print FIFTH_PART $line;
}
$i++;
if ($i >= 5) { $i = 0; }
}

Just to note the files these will be split out to work fine, and I can read the file just fine, I am just not sure if this part of the code will work, and the only way I have to test this wonderful stuff is in a live functioning environment. If there is a better way to do this comparision of a string split, please give me your input.
Thanks.

Mouldy_Goat
10-31-2002, 12:55 AM
Hi Socrates,

I'm a bit confused as to what exactly you're trying to do. What I think is, you've got a file with a format as you've described on each line.

But this:
if <= '2425' files will go to file1
if > '2425' && <='46' files will go to file 2
if > '46' && <= '677' files will go to file 3
if > '677' && <= 'E02 files will go to file 4
if > 'E02' files will go to file 5.

Doesn't make much sense. Not many numbers are going to be greater than 2425 and less than 46. And I'm not sure what E02 is meant to be... perhaps 1 x 10^2?

This code:
$first_split = substr($line,2,4);
$second_split = substr($line,2,3);
$third_split = substr($line,2,2);

Occurs before you've defined $line, so all scalars there will have no value. And if $line was defined, $first_split would get four characters from and including the 3rd character in $line, $second_split would get 3, and $third_split would get the 3rd character, the 4th character.

And I can't see why you've decided to have a looping counter which resets after 5, since it's not used in your code.

If you cleared it up a little I'm sure people would be happy to help.

Socraties
10-31-2002, 07:32 PM
I can see what you mean now that it is a bit confusing. What that piece of code is for is to parse a file. This part i put in that says:

if <= '2425' files will go to file1
if > '2425' && <='46' files will go to file 2
if > '46' && <= '677' files will go to file 3
if > '677' && <= 'E02 files will go to file 4
if > 'E02' files will go to file 5.


is for a file the numbers 2425 is part of an office id. This file is a patient file where patients are associated to a particular doctors office. The part where i have: The numbers that are only 2 digits are part of a longer number, and I only need to seperate it out based upon that 2 character mapping. The E02 is just part of an office id.
I don't need to split this file in an exact numerical order, I am splitting it out based upon office id values that are greater than or less than that character mapping. It does seem hard for numbers to be over 2425, but with this file it isn't due to the fact the these numbers are assigned by the clinic. This file will contain roughly 200,000 lines if not more. A typical file size that we get from them is about a 10 meg text file.

The counter is set up higher in the code. All the counter is used for is to write these files out to the particular directory system. Here is the entire perl script, I have had to take out the important directory structure but it does work currently, the script is too long to post in the message so i will post in a new message just after this one.

Socraties
10-31-2002, 07:33 PM
here is the perl script in entirity.


#!/usr/bin/perl
#--------------------------------------------------------------------------------------------------
# Script to transfer medic data from uucp to web01 ftp directory for icx machine
#--------------------------------------------------------------------------------------------------
# 1: scan /dir/dir/dir (FTP site drop location) for files that match "textfile*"
# 2: check whether the file is send today or not
# 3: if found open uucp log and parse out lines for this file
# 4: if send by today it will check lines for completed transmision
# 5: check duplicate file
# 6: concatenate nonduplicate file to one big file
# 7: adjust the time by the area code
# 8: ftp file to //dir/dir/dir/dir/dir/
# FASTCALL.TXT is no longer transfered to web01 06/01/2000!!! Above #8 is disabled
#9. split file into two different files (2/3 and 1/3) for pickup by the 2 ICX machines
#10. copy file to /dir/dir/dir/dir/dir
#--------------------------------------------------------------------------------------------------

$source_dir = "/dir/dir/dir/dir/dir";
$archive_dir = "/dir/dir/dir/dir/dir";
$source_dir2 = "/dir/dir/dir/dir/dir";
$archive_dir2 = "/dir/dir/dir/dir/dir";
$msg_file = "/dir/dir/dir/dir/dir";
$source_file = "FILE";
$pre_file = "prefile";
$icx_import_dir = "/dir/dir/dir/dir/dir";
$trans_filename = "TEXTFILE.TXT";


$today = `date`;
$hour = `date "+%H"`;
$mins = `date "+%M"`;
$tag = `date "+%m%d%H%M"`;
$senddate= `date "+%y%m%d"`;

chop $tag;
open ( MSG, ">>$msg_file.$tag" );

@file_names = `ls $source_dir | grep fastcall`;
@pre_file_names = `ls $archive_dir | grep fastcall`;
@former_files = `ls $icx_import_dir | grep FASTCALL`;

# NEXT LINE LOOKS FOR ANY DATA NOT PICKED UP BY ICX MACHINES (more than zero files)
if ( $#former_files >= 0 )
{
open ( WARN, ">/dir/dir/textfile.txt" );
print WARN "there is an ICX machine not picking up it's files, \n";
print WARN "It's $today";
close WARN;
`mail me\@localhost.net < /dir/dir/textfile.txt`;
`mail me\@localhost.net < /dir/dir/textfile.txt`;
}
# NEXT LINE LOOKS FOR LACK OF FILES (less than one file)
if ( $#file_names < 0 )
{
print MSG "\n$today no file has been sent yet \n";
close MSG;
`mail -s "Report from med_trans.pl" me\@localhost.com < $msg_file.$tag`;
at_job();
}


# step 3

# NEXT LINE LOOKS FOR ONE FILE (less than 2), or more than ONE (else)
# ISN'T PERL FUN?
if ($#file_names < 1 ){
print MSG "Today $today we have received the following file\n";
}
else {
print MSG "Today $today we have received the following files\n";
}
foreach $file ( @file_names ) {
print MSG "$file\n";
}
print MSG "The following files have been concatenated\n";
print MSG "If the file is not on the list, it is a \n";
print MSG "duplicate file and can be found in $archive_dir directory.\n";

$completeFlag = 0;
$flag = 1;

foreach $file ( @file_names ) {
$flag = 0;
chomp ( $file );

@fileNameParts = split(/\./, $file);

if ($fileNameParts[1] == $senddate)
{
&CheckCompleteFile;
if (@sub_log_lines){
foreach $line ( @sub_log_lines )
{
if ( $line =~ /Call complete/ )
{
$completeFlag = 1;
}
}
}
}
else
{
$completeFlag = 1;
}
if ( $completeFlag == 1)
{
if ( $file =~ "gz" ) {
`gunzip $source_dir/$file`;

# remove '.gz' fron $file
chop $file; chop $file; chop $file;
}
## concatenate the files together
if ($#pre_file_names > 0 ){
## check duplicate files
foreach $old_file ( @pre_file_names) {
chomp ($old_file);
if ( $old_file =~ /$file/ )
{
$flag=1;
}
}
}

if ( $flag ==0) {
`cat $source_dir/$file >> $source_dir/$pre_file.$hour`;
$msg1=`wc -l $source_dir/$file`;
print MSG "$msg1\n";
}
`mv $source_dir/$file $archive_dir/$file`;
}
}

@havefile =`ls $source_dir|grep prefile`;
if ( $#havefile < 0 ){
print MSG "Duplicate file received. No file has been transfered to $icx_import_dir at this time\n";
}
else{
#$pre_areacode ="";

## 07/28/00 medic is no longer to need time zone adjust.
## adjust_times will count total of invalid data.
&adjust_times;

if ($count > 0 ){
print MSG "\nTotal rows for status of 61, 71 or 72 = $count\n";
}
print MSG "\nTotal number of rows in concatenated file\n";
$msg=`wc -l $source_dir/$source_file.$tag`;
print MSG "$msg\n";
print MSG "which has been transfered to $icx_import_dir.";
`rm $source_dir/$pre_file.$hour`;
}

#if ($flag ==0){
#transfer_file();
#}

@trans_file = `ls $source_dir/$trans_filename`;
if ( $#trans_file < 0)
{
`cp $source_dir/$source_file.$tag $icx_import_dir/$trans_filename`;
}
else
{
`cat $source_dir/$source_file.$tag >> $icx_import_dir/$trans_filename`;
print MSG "icx hasn't picked up the previous file yet\n";
print MSG "The new file has been concatenated to the previous file!\n";

}

close MSG;

split_file();
`mv $icx_import_dir/$trans_filename $icx_import_dir/fastarchive/$trans_filename.OLD`;
`mail -s "Report from med_trans.pl" ops\@smarttalk.com < $msg_file.$tag`;

#--------------------------------------------------------------------------------------------------
# Functions
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
# Notes: This file will be split out in a new way. The following format is:
# <= '2425' : Machine 1
# '2425' > & <= '2446' : Machine 2
# '2446' > & <= '2677' : Machine 3
# '2677' > & <= 'E02' : Machine 4
# 'E02' : Machine 5
# These Changes have been made as of 10/28/02.
#--------------------------------------------------------------------------------------------------
sub split_file {
open ( BIG_FILE, "$icx_import_dir/$trans_filename" ) || die "Could not open $icx_import_dir/$trans_filename";
open ( FIRST_PART, ">> $icx_import_dir/FASTCALL11.TXT" ) || die "Could not open $icx_import_dir/FASTCALL11.TXT";
open ( SECOND_PART, ">> $icx_import_dir/FASTCALL13.TXT" ) || die "Could not open $icx_import_dir/FASTCALL13.TXT";
open ( THIRD_PART, ">> $icx_import_dir/FASTCALL14.TXT" ) || die "Could not open $icx_import_dir/FASTCALL14.TXT";
open ( FOURTH_PART, ">> $icx_import_dir/FASTCALL15.TXT" ) || die "Could not open $icx_import_dir/FASTCALL15.TXT";
open ( FIFTH_PART, ">> $icx_import_dir/FASTCALL16.TXT" ) || die "Could not open $icx_import_dir/FASTCALL16.TXT";
#format of substr() function is: substr(string,offset,length);
$first_split = substr($line,2,4);
$second_split = substr($line,2,3);
$third_split = substr($line,2,2);

$i = 0;
foreach $line (<BIG_FILE>){
if($first_split <= '2425'){
print FIRST_PART $line;
}elsif ($first_split > '2425' && $third_split <= '46'){
print SECOND_PART $line;
}elsif ($third_split > '46' && $second_split <= '677'){
print THIRD_PART $line;
}elsif ($second_split > '677' && $second_split <= 'E02'){
print FOURTH_PART $line;
}elsif ($second_split > 'E02'){
print FIFTH_PART $line;
}
$i++;
if ($i >= 5) { $i = 0; }
}
#CURRENT WAY FILE IS BEING PARSED, BY EACH RECORD IN THE FILE, NO PARTICULAR SEQUENCE.
}#end split file
#old part of split file.
# $i = 0;
# foreach $line (<BIG_FILE>) {
# if ($i == 0) {
# print FIRST_PART $line;
#}
#elsif ($i == 1) {
#print SECOND_PART $line;
#}
#elsif ($i == 2) {
#print THIRD_PART $line;
#}
#elsif ($i == 3) {
#print FOURTH_PART $line;
#}
#elsif ($i == 4) {
#print FIFTH_PART $line;
#}
#$i++;
#if ($i >= 5) { $i = 0; }
#}
#close BIG_FILE; close FIRST_PART; close SECOND_PART; close THIRD_PART; close FOURTH_PART; close FIFTH_PART;
#}end split_file.


sub transfer_file {
$ftp = Net::FTP->new('new.localhost.com');
$ftp->login( 'NONE', 'NONE' );
$ftp->cwd( "/dir/dir/dir/dir" );
#$ftp->cwd( "/dir/dir/dir/dir" );
$ftp->append( "$source_dir/$source_file.$tag", "FASTCALL.TXT" );
#$ftp->put( "$source_dir/$source_file.$tag", "F_test" );
$ftp->quit();
chomp ( $file );
#`mail me\@localhost.com < $msg_file.$tag`;
#`mail me\@localhost.com < $msg_file.$tag`;
#`mail me\@localhost.com < $msg_file.$tag`;
`rm $msg_file.$tag`;
}


sub adjust_times {

$dumpfile = "$source_dir/$pre_file.$hour";
$outfile = "$source_dir/$source_file.$tag";

open (DUMP, "<$dumpfile") || die "Couldn't open file$dumpfile";
open (OUT, ">$outfile") || die "Couldn't open file$outfile";

$count=0;
while ( $string = <DUMP> ){

$areacode= substr($string, 76, 3);
$val = substr($string, 144,2);
$val2 = substr($string, 148,2);
$val3 = substr($string, 0,2);
foreach $code ( 61,71,72 ) {
if ($val3 == $code ){

++$count;
}
}
printf OUT $string;

}
close DUMP;
close OUT;
}

sub error_mail {
}

sub at_job {
exit ( 0 );
}

sub CheckCompleteFile {
open ( UUCPLOG, "/dir/dir/dir/dir/Log" ) || error_mail( "cannot open uucp log" );

foreach $line ( <UUCPLOG> ) {
@lines = ( @lines, $line );
}
close ( UUCPLOG );

foreach $line ( @lines ) {
if ( $line =~ /$file/ ) {

@line_parts = split ( / /, $line );
$trans_num = $line_parts[5];
chop ( $trans_num );
$trans_num = " ".$trans_num;

if ( $line =~ /$trans_num/ )
{
@sub_log_lines = ( @sub_log_lines, $line );
}
}
else { $completeFlag = 1; }
}
}

I hope this is more helpful.