Using perl (or something that can be called and output fed back into perl) to grab file headers
|
Author | Content |
---|---|
techiem2 Jul 20, 2011 2:18 PM EDT |
I'm trying to process some files that a certain program adds a file header to before the primary file headers.
I found the info on opening a file in binary mode and reading x characters.
The problem is, the header is not fixed length.
Does anyone know how I could determine where that header ends so I can parse that header and then process everything after it? I noticed if I parse too far (I have it printing the output), it prints the header text then some odd ascii character or something (in my terminal it's the white circle with question mark). I believe that would be the first character of the actual file data after that added header. Does anybody know a way to check this so I can extract just the header data? Thanks! |
penguinist Jul 20, 2011 3:16 PM EDT |
Application header data is not standardized, so you need to know a little about the structure of the header that this application writes before you can reliably read it. One common format is to first write a binary or ascii length field followed by that amount of data. This type of header can be read in two steps, first read the short (usually 2 or 4 byte) length field, then use its length to read exactly to the end of the corresponding data. Since your header data is text, it might also be possible that what you are looking at is a null terminated or eol terminated string. To see the header data in detail (including the unprintable length or termination characters) you can do this: xxd myfile | less Then browse your data and see what you have. Perhaps you could post the first bytes from a sample file, and we could help interpret the format. |
cr Jul 20, 2011 3:22 PM EDT |
Were this me doing it, my first step would be to view maybe a half-dozen affected files in a hex-editor to see if that "odd ascii character" is a reliable landmark; if not, you've given us too little info to work with. Assuming it's reliable, do you need to parse that header, or merely discard it ('extract' is ambiguous here)? If the latter: - read() in a bufferful twice as big as you think you'll need to reliably contain that landmark, into a scalar - take an index() to that landmark. safety-test to be sure the index returned other than minus-one. (In C that'd be a strchr() or, if your landmark is in fact multibyte, strstr(); in perl, both cases are handled by that index().) - seek() to index-plus-one. your next read() will pull in the start of the post-header file-content, which you can then pump out into a new header-free file. [edit: amend index-offset] |
techiem2 Jul 20, 2011 4:21 PM EDT |
Yeah, a friend is looking at it with me.
There does seem to be an odd ascii character after the added header that we are trying to figure out how to work with.
I do need the header. Basically what I'm trying to do (and you'll probably laugh at this), is write a script to recover files from a Norton360 Backup archive (hey, you never know, it could be useful sometime...and I thought it would be fn to try). As I discovered when looking last night, basically it takes the file and adds a header with the original file path/name before the raw file data. (I was quite surprised that that's apparently all they do to the files - no archive, no encryption, just a header prepended) I have a test file (a boring jpg) from the archive I'm working with at http://www.techiem2.net/files/test0.n360 if you actually want to look at it. Basically I want to write up a script that will walk a directory in the archive and "Restore" the files to their original directory structure (but on the Linux box in a given base dir of course, for the needed files to be safely copied back to the winders machine). So basically I need to read the n360 header section to get the path/filename, then read the rest of the file and output it to the proper location/name. |
penguinist Jul 20, 2011 4:38 PM EDT |
Looking at your file in a hex view, a few things stand out about the header structure: The header begins with a null terminated string: "NB20Y" This probably identifies the Norton Backup header version. Next comes a null terminated unicode string containing the file path. "...DSC00699.JPG" Next comes the JPG data itself starting with the Exif fields containing photo date, camera used etc. So, to display this header you would read and display the two null terminated strings. To remove the header, you would locate the second null terminator and start your file copy on the following byte. |
penguinist Jul 20, 2011 5:19 PM EDT |
For example, this code snippet in ruby prints your file's header. I'm sure perl would be similar, but you can see the logic with this example.File.open("test0.n360") do |f| puts f.gets("\x00").chop # print the NB20Y version identifier puts f.gets("\x00\x00").chop # print the path and file name end |
techiem2 Jul 20, 2011 5:25 PM EDT |
Yeah, that got me on the right track and I've now got the filename pulled:
#!/usr/bin/perl -w open($fh,"<","test0.n360"); # portably open a binary file for reading binmode($fh); read ($fh,$header,255,0); |
techiem2 Jul 20, 2011 5:26 PM EDT |
erm...the forum ate some of that code...like all the backslashes...lol |
techiem2 Jul 20, 2011 10:54 PM EDT |
Ok, I think I'm almost there with the basic functionality, but I've run into a nasty glitch.
Since the extracted data is from a binary file, the strings all have that encoded data in them.
I.E. I can use them as normal ascii strings to build paths/filenames/etc. So how do I turn the string I'm pulling from the header into a real string? |
techiem2 Jul 20, 2011 11:24 PM EDT |
aha! Turns out in that header there's just a null between each character. So doing $filenamepath =~ s/x00//g cleaned it up nicely. |
tuxchick Jul 20, 2011 11:56 PM EDT |
pastebin is good for sharing code online: http://pastebin.com/ |
techiem2 Jul 21, 2011 2:37 AM EDT |
When I'm finished testing it I'll post it up. It seems to be working properly now. I had to increase the length of the read to get the full header for some files, but so far my test run is looking good. |
techiem2 Jul 23, 2011 11:29 AM EDT |
Blog post created and script uploaded. http://bit.ly/qO6CeQ Thanks for the help all! |
penguinist Jul 23, 2011 6:39 PM EDT |
By the way, your idea of using a regular expression to convert the unicode string into an ascii string will work fine as long as there are no international characters in the string. If you or someone else ever wants to publish this script for general use, it would probably be a good idea to do proper unicode handling. Other than that fine point, your solution looks great! |
You cannot post until you login.