Mass processing .doc files

Forum: LinuxTotal Replies: 7
Author Content
techiem2

Feb 07, 2008
12:43 PM EDT
Ok, so apparently we need to make the syllabi available to the students. Which means we want them in pdf (easy enough). The Problem is, they contain a Course Outline section that's basically an anticipated plan for the class schedule, which of course the students aren't supposed to have. So what I need to do is, read the word file, process OUT that section (wherever it may be - there isn't a standard layout), and dump the "fixed" version to pdf (without losing the formatting). At one point my boss had a word macro to do it, but that's been long since lost. I assume there's a way to make OO.o do this, but I have no experience whatsoever with macros and such.

Any pointers? Ideas? Other methods?

Thanks guys and gals as always!

(as an aside, my sign control system is coming along nicely, I'm at the point of figuring out how to control mplayer with it - the button to force conversion of a file doesn't work [the script starts then dies soon after], but that process could be cron'd anyway, so it's not vital if I can't get it working from my php interface)

Mark II
Sander_Marechal

Feb 07, 2008
1:01 PM EDT
Try running OOo in headless mode on a server and script it from the outside using Python. Converting to PDF is really easy. Take a look at Mirko Nasato's PyODConverter: http://www.artofsolving.com/opensource/pyodconverter

Try converting that script so it will strip the required section before converting to PDF. Then write up a cron job that watches some directory for .doc files and converts anything that it finds into PDF. Should't be that hard. I highly recommend subscribinng to OO's dev mailinglist at api.openoffice.org for help.
ColonelPanik

Feb 07, 2008
3:11 PM EDT
Do it in Moodle
dinotrac

Feb 07, 2008
3:47 PM EDT
Sander -

Yup. Headless OOo is an amazing thing -- an amazing thing that most people and businesses know nothing about. I know at least one company that paid tons of money for a fancy document server that , in their use, did nothing more than running through OOo would have done.

Dopes.
azerthoth

Feb 07, 2008
4:20 PM EDT
I havent played with it, but I do remember when playing with samba configs seeing an example path to .pdf printer .. ie feed in one and get out a .pdf.
techiem2

Feb 07, 2008
10:30 PM EDT
Yeah, I have cups-pdf setup. The main thing I'm trying to figure out is how to automatically process the silly docs to rip out what's not wanted... I'll start by takeinga look at the python and ooo headless stuff when I have some time to work on it. Could be useful.
dinotrac

Feb 07, 2008
11:43 PM EDT
If you've got doc files, you can do it a couple of ways. I think you can get headless OO to do OO's own scripting, but I've never tried that. You can pretty easily convert the .doc to ODF. From there you can do amazing things with stylesheets and/or scripts, because it's all xml.
Sander_Marechal

Feb 08, 2008
12:38 AM EDT
I'm pretty sure that it's possible using Python. I've adepted that PyODConverter script myself to do things like updating the table of contents. The biggest problem is that OOo's API is huge and the docs are all geared to Java, not Python.

You cannot post until you login.