Friday, September 5, 2008

Unique Sorting Of Lists And Lists Of Lists With Perl For Linux Or Unix

Hey There,

Yesterdays post on sorting Perl lists and removing duplicates did end up being a bit heavy at the end, as I feared. I received my fair share of "here's how you can do it better" emails, which (although I'm not trying to do my most succinct and efficient work here) I always appreciate. In all seriousness, I really do enjoy getting different perspectives on methods and practices put up on this blog. I'm always looking to better my own skill-set, so keep the free advice coming. I'll probably devote an entire future post to the suggestions I've gotten regarding this (with attribution to the contributor's, of course) as the myriad means to an end are quite diverse, clever and interesting. And, yes, we do print contrary opinions as well, but they have to be logical arguments or, at the very least, scathing put-downs that make sense within the given context. If you just write us to tell us to go #### ourselves, we may take you up on that (it can get lonely out here in cyberspace ;) but it won't ever end up as a comment attached to a post.

For today, we're going to break down two areas which generated the most response (mostly because, although I tried, the code just wasn't self-explanatory enough). Sometimes it's actually harder to try and explain something you already do by rote than to explain something you're in the process of picking up on yourself. The two examples for today fall into the former category. My blinders made me think that the code I put out would be easily digested, but (after reading a few comments) I agree that perhaps I was a bit presumptuous. But, life goes on and we learn from our mistakes (even if we just keep making them over and over again ;)

The basic questions, and a bit more detailed of a breakdown:

Q. Why the klunky foreach block at the top of the attached code? Shouldn't sorting be simpler?

A. Actually, sorting data is simpler, but, in this case we were dealing with sorting on two levels, which may not have been readily apparent. The @s_perms array that gets returned from the public domain "permute" subroutine actually contains multiple scalar pseudo-lists. That is to say that $s_perms[0] could be equal to "1 3 2 4 5." While this resembles our lists that we want to sort, it is actually just one variable made up of numbers and spaces. So, for each member of the @s_perms array, we did the following:

1. First we split the scalar variable into an actual array, so that "1 3 2 4 5" became @temparray, with values of 0 through 4 ($temparray[0] = 1, $temparray[1] = 3, etc), like so:

@temparray = split(" ", $unsorted);

2. Then, we sorted that newly created array. The reason we sorted each value returned by the "permute" subroutine was so that each value would be in order. When we do it this way (below), every @temparray list ends up with the members, in order, of "1 2 3 4 5." If we had just called a numeric sort ( $a <=> $b ) on @s_perms, it would have sorted all of the individual lists using their initial numeric value as the sort index. To put it more simply, assuming @s_perms contained only 3 scalar values of:

"1 2 3 5 4"
"4 5 3 2 1"
"1 5 4 2 3"


a straight-up sort on @s_perms would generate a sorted list like this (in order, top to bottom):

"1 2 3 5 4" <-- All 3 scalar values ("1 2 3 5 4," "1 5 4 2 3," etc, being considered single entities) are being sorted as individual elements within an array.
"1 5 4 2 3"
"4 5 3 2 1"


Since our Objective states that all permutations of the same lists of values should be considered equal, the straight-up sort wouldn't make it easier (later) to determine if any, or all, of the 3 scalars were representative of unique or duplicate lists. When we break down each scalar value within the @s_perms array and split that scalar into a temporary array (@temparray) and then do the sort on that, only the values in @temparray get sorted. That is to say that "1 2 3 5 4" gets sorted, and then "1 5 4 2 3," etc. This results in our generating a sorted list like this:

"1 2 3 4 5" <-- Each scalar is split into an array with the "@temparray = split(" ", $unsorted);" statement from above and that array is sorted independently of each succeeding or preceding one!
"1 2 3 4 5"
"1 2 3 4 5"


3. Then we get to the step that caused some folks to wonder why it wasn't the first action taken:

@temparray = sort { $a <=> $b } (@temparray);

Now, we're sorting just like we would have in step 2 (if we did it this way), except we're sorting array variables that have already had their "insides" sorted. Having a list of array elements like:

"1 2 3 4 5"
"1 2 3 4 5"
"1 2 3 4 5"


makes it a lot easier to weed out the duplicates later.

4. Then, we add that array (of individually, and universally, sorted arrays) to the right hand side of the @sorted_perms array, which we use further on in the script.

unshift(@sorted_perms, "@temparray");

Simple enough ;) Hopefully, I'm not making this any worse for the folks out there who wrote in ;) Seriously, though, I do hope this explanation helps out at least a little.

The second question was just as valid:

Q. What's with the @unik and @last_unik variables. I understand that you're comparing arrays, but what are you doing in your "elsif" equality test?

A. This is a great question, because, on its face it seems that all we would have to do is check if the two arrays are equal and that would be that. Unfortunately, this practice has one fatal flaw (and, curiously enough, the behaviour is the same if you use the numeric or alpha comparison operators!) The basic test of the "entire array" actually only considers quantity of indices as indicators of uniqueness. For instance, this small script:

#!/usr/bin/perl

@bob = qw (1 2 3 4 5);
@joe = qw (1 3 2 5 4);

if ( @bob == @joe ) {
print "BOB and JOE are == Equal\n";
}

if ( @bob eq @joe ) {
print "BOB and JOE are EQ Equal\n";
}


produces this output:

host #./test
BOB and JOE are == Equal
BOB and JOE are EQ Equal


Since the basic equality operators only count indices to compare, I used that as the "outside loop" test. I figured, if the two arrays didn't have the same number of members, they couldn't possibly be equal, and we'd move on to the next loop without doing all the extra work. However, if the index counts match, then we need to check and see if the "content" of the arrays are exactly equal. So, if:

} elsif ( @unik == @last_unik ) {

returns true (both @unik and @last_unik have the same number of members), we'll go check them out index by index. I started out by setting three variables, just for ease of reading and to establish controls.

1. $total = @unik; <-- This populates the $total scalar variable with the length of the @unik array

2. $zero = $nope = 0 < -- This is just simple shorthand for assigning the variables $zero and $nope the value of "0."

3. Then, I iterated through both arrays at the same time (At this point in the program, every array has been sorted, so they should all be "1 2 3 4 5"). This was accomplished by a simple while loop that incremented the $zero variable (which represented the index of the @unik and @last_unik arrays) and compared those index values with each other. If, at any point, an index in @unik didn't match the corresponding index in @last_unik, the variable $nope was set to 1 and the while loop was broken (no sense calculating the rest of the index values, since we just hit a difference and the two arrays can't be identical anymore). Every time the corresponding values matched, the value of $zero was incremented by 1, and the process repeated, until the value of $zero reached the value of $total (the length of the @unik array:

while ( $zero < $total ) {
if ( "$unik[$zero]" ne "$last_unik[$zero]" ) {
$nope = 1;
last;
} else {
$zero++;
}
}


4. The block of code following that would only execute if we had found a difference between the two arrays, and the value of $nope was set to 1. This would be an indicator that @unik was a different (unique) array than the previous one and should be printed out. Then (and this shows up more than once), after the value is printed, I set @last_unik to be equal to @unik (Note that the = "assignment" operator is used here, rather than the == "comparative" operator). This sets us up to test the next @unik against @last_unik (which was @unik in the previous iteration) and easily be able to tell if that next array is just more of the same. This is set elsewhere within the larger construct, as well (on a different condition).

if ( $nope ) {
print "$s_p";
@last_unik = @unik;
}


5. As a side note for this piece of code, again, it was written to be simple to follow and works on the assumption that we've sorted the individual arrays numerically (see the first question at the beginning of this book ...I mean, post ;) and they are all the same. For instance, if the array @sorted_perms contained @array1, @array2 and @array3, and they were populated like this:

@array1 = "1 2 3"
@array2 = "3 2 1"
@array3 = "1 2 3"


You would not get a return of 2 values ("123" and "321") since we're not maintaining state in any other way. Since array's 1, 2 and 3 are processed in order and different from each other (in order, one after the other), they would all show up as unique (and be printed) in our example. If we wanted to make sure that we only got "123" and "321" back, we'd have to implement another method to double check that we weren't repeating ourselves. Of course, that would add another level of complexity, and I'm trying to keep this as simple as possible, but, since you asked, you can easily avoid duplicate unique values (??? Can those really exist in a sane universe? ;), you can do this most simply (I think), using grep, and Perl hashes, in the right place, like in the pseudo-code below:

foreach $bogus_value (@bogus_array) {
unless ($seen{$bogus_value}) {
$seen{$item} = 1;
unshift(@bogus_array, $bogus_value);
}
}


and, that would guarantee you that no "duplicate unique" values would appear in your output :)

That's about all I can sneak in at lunch here today ;) Until next time.
Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.