Not a huge package, but in case you get lost:
Path | Contents |
wv2 | Holds some build system stuff and general build information. |
wv2/doc | Here we keep some information for developers and a Doxygen file to generate the API documentation. |
wv2/src | Contains 99% of the sources. As we don't want to have a build-time dependency on Perl we also added the generated Code to the CVS tree. |
wv2/src/extra | These are the libraries we ship with wv2. Right now it's a mini-glib and libole2. |
wv2/src/extra/glib | A tiny version of glib 1.2.7 by Dom Lachowicz |
wv2/src/extra/libole2 | The sources of the GNOME libole2 0.2.4 |
wv2/src/generator | Two Perl scripts, some template files, and the available file format specification for Word 8 and Word 6. This stuff generates the scanner code. If you finished reading this document you might want to check out the file format spec in this directory. |
wv2/src/tests | Mainly self checking unit tests and function tests for the library. Use "make check" to build them. |
Viewed from far, far away the filter structure somehow looks like that:
A Word document consists of a number of streams, embedded in one file. This file-system-in-a-file is called OLE structured storage. We're using libole2 to get hold of the real data. The filter itself consists of some central "intelligence" to perform the needed steps to parse the document and some utility classes to support that task. During the parsing process we send the gathered information to the consumer, the program loading the Word file (on the right). This program has to process the delivered information and either assemble a native file or stream the contents directly into the application.
The interface to the documents is a C++ wrapper around the libole2 library. libole2 allows
to read and write OLE streams from and to the document file. It's quite inconvenient to use,
though, so we created a class representing the whole document (OLEStorage
), and
two classes for reading and writing a single stream (OLEStreamReader
and
OLEStreamWriter
).
The external API for the users of the library should consist of at least two, but maybe more, layers. Ranging from a low level and fine grained API where lots of work is needed on the consumer side to a very high level API basicly returning enriched text, at the cost of flexibility.
Another main task of that API is to hide differences between Word versions if that's feasible. In any case even the low level layer of the API shouldn't expose too much of the uglyness of Word documents. The parser logic will differ between format versions and this has to be considered for all further design issues. Most likely we will choose some strategy pattern approach within the parsing section of the code to replace the logic behind the scenes while keeping the same API.
Currently we have a Parser
baseclass for all parsers (Is-A), and a Parser9x
baseclass for the Word9x filters (Is-Implemented-In-Terms-Of).
This part of the code surely is most demanding from a design point of view. I'd be very pleased to hear some of your ideas :-)
The core part of the whole filter. This part of the code ensures that the utility classes are used in the correct order and manages the communication between various parts of the library. It's also quite challenging to design this part of the code. Various versions contain similar or even identical chunks, but other parts differ a lot. The aim is to find a design which allows to reuse much of the parser code for several versions.
Right now it seems that we found a nice mixture of plain interfaces with virtual methods and fancy functor-like objects for more complex structures like footnote information. The advantage of this mixture is, that common operations are reasonably fast (just a virtual method call) and yet we provide enough flexibility for the consumer to trigger the parsing of the more complex structures itself. This means that you can easily cope with different concepts in the file formats by delaying the parsing of, say, headers and footers till after you read all the main body text.
This flexibility of course isn't free of costs, but the functor concept is pretty lightweight, totally typesafe, and it allows to hide parts of the API. I'd like to hear your opinions on that topic.
We agreed to use Harri Porten's UString
class from kjs, a clean implementation of
an implicitly shared UCS-2 string class (host order Unicode values). In the same file (ustring.h)
there's also a CString
class, but we'll use std::string
for ASCII strings.
The iconv library is used to convert text stored as CP 1252 or similar to UCS-2. This is done by
the Textconverter
class, which wraps libiconv. Some systems ship a broken/incomplete
version of libiconv (e.g. Darwin, older Solaris versions,...), so we have a configure option
--with-iconv-dir
to specify the path of alternative iconv installations.
To reduce the complexity of the code we try to write small entities designed to do one task (e.g. all the code in styles.cpp is used to read (and later on probably write) the stylesheet information contained in every Word file, lists.cpp cares about lists,...). We use a certain naming scheme to distinguish code which works for all versions (at least Word 6 and newer) or just for one specific category. All the *97.(cpp|h) files are designed to work with Word 8 or newer, files without such a number should work with all versions.
This part of the code also consists of a number of templates to handle the different ways arrays and more complex structures are stored in a Word file (e.g. the meta structures PLF, PLCF, and FKP). If that sounds like Greek to you it's probably a good idea to read the Definitions section at the top of the file format specification in wv2/src/generator.
It's a tedious job to implement the most basic part of the filter -- reading and writing the
structures in Word documents. It is boring, repititious, error prone, so we decided to generate
this ultra-low level code. We're using two Perl scripts and the available HTML specification
for Word 8 and Word 6. One script called generate.pl
is used to scan the HTML file
and output the reading/writing code and some test files. The other script, convert.pl
generates code to convert Word 6 to Word 8 structures. We need to do this, because we want to
present the files as Word 8 files to the outside world. The idea behind that is to hide all the
subtle differences between the formats from the user of this library. For Word 6 this seems to
be possible, no idea if that will work out for older formats.
A vital part of the whole library are self-checking unit and function tests, to avoid introducing hard to find bugs while implementing new features. The goal is to test the major components, but it's close to impossible to test everything. Please run the unit tests before you commit bigger changes to see if something breaks. If you find out that some test is broken on your platform please send me the whole output, some platform information, and the document you used for testing.
It's a bit hard to test the proper parsing of a file, so there aren't many full-automatic tests for that part of the code yet, and I honestly don't see any easy way out. We probably have to think of some way to abuse the KWord wv2 consumer to perform the synthetic tests. Suggestions are highly welcome.
At the moment most of the basic work for the filter is done or close to be done. We have a working build system, code to read and write OLE streams, code to scan the basic building blocks of Word documents, and some utility classes like the string class. The filter is able to read the text including properties, it handles fonts, lists, headers/footers, sections, fields (to some extent),...
This section of the document lists all the existing items and the main idea behind them. We also briefly discuss the design and the reasons to choose exactly that design. This section also contains a discussion about code we don't have yet, and some ideas about a possible design.
The OLE chunk of the code is basicly done. It utilizes libole2 and provides a stream-based API
to read and write from/to the file. OLEStorage
is the class to handle the whole document
and travel through the "directories." OLEStream
provides the common interface
for readers and writers, like seeking in the stream and pushing and popping the current cursor position.
OLEStreamReader
and OLEStreamWriter
inherit OLEStream
and provide
the real reading/writing methods.
This part of the code contained in the ole* files is generally straightforward. Pending issues are the dependency on libole2 (we might want to switch to libgsf or something completely different (LGPL code?) later on).
The API is a mixture of a good old "Hollywood Principle" API (Don't call us; we'll call you) and a fancy functor-based approach. The Hollywood part of the API can be found in the handler.h file, it's split across several smaller interfaces. We are incrementally adding/moving/removing functionality there, so please don't expect that API to be stable.
The main reason to choose this approach is that the very common callbacks like TextHandler::runOfText
are as lightweight as possible. More complex callbacks like TextHandler::headersFound
allow a good
deal of flexibility in parsing, as the consumer decides when to parse e.g. the header. This helps to avoid
nasty hacks if the concepts of the destination file format differ from the MS Word ones.
The main task in the parser section is to find a design which allows to share the common code between different file format versions. Another important task is to keep the coupling of the code reasonably low. I see a lot of places in the specification where information from various blocks of our design is needed, and I really hate code where every object holds 5 pointers to other objects just because it needs to query some information from every of these objects once in its lifetime. Code like that is a pain to maintain.
For the code sharing topic my current idea is to have a small hierarchy of Parser*
classes like
this one:
Parser
is an abstract base class providing a few methods to start the parsing and so on. This
is the interface the outside world sees and uses. Parser97
(and also Parser95
, which
would be at the same position in the hierarchy as Parser97
) are the two real classes doing all
the work. So far this is perfectly normal Is-A inheritance and nothing to talk about. The Parser9x
class is a bit different from what you would expect, though. Parser97
inherits privately from
Parser9x
(Is-Implemented-In-Terms-Of inheritance). Parser9x
will consist of a number of
non-virtual helper methods which can be shared between Word 6 and Word 8, and additionally implement the
template method pattern for more complex shared algorithms (due to the virtual template methods we need inheritance
and plain delegation isn't sufficent).
The whole parsing process is divided into different stages and all this code is chopped into nice little pieces
and put into various helper/template methods. We take care to separate methods in a way that as many of them as
possible can be "bubbled up" the inheritence hierarchy right to Parser9x
.
Right now Parser9x
is empty, as the Word 6 parser isn't started yet. As soon as this gets done we can
move around the code which lives in the parser97.* files for now.
To keep the coupling between the blocks of the design low the parser has to implement the Mediator pattern or something similar. It is the only block in our design containing "intelligence" in the sense that it's the only block knowing about the sequence of parsing and the interaction of the encapsulated components like the OLE subsystem and the stylesheet-handling utility classes.
The main classes UString
and std::string
are well tested and known to work well.
One advice when using UString::ascii
is to take a lot of care. The buffer for the ASCII
string is shared among all instances of UString
(static buffer)! As we need that method for
debugging only this is no problem. UString
is implictily shared, so copying strings is rather
cheap as long as you don't modify them (copy on write semantics).
Older Word versions don't store the text as Unicode strings but encoded using some codepage like CP 1252.
libiconv helps us to convert all these encodings to UCS-2 (sloppy: 16bit Unicode). We don't use libiconv
directly from within the library, but we use a small wrapper class (Textconverter
) for convenience.
Utility classes perform one specific, encapsulated task, like reading in the whole stylesheet information and
provide convenient access to it. These classes are, IMHO, the key to clean code. Classes for the programming
infrastructure like the SharedPtr
class also belong to this category. If we manage to encapsulate
many of the more complex structures in a Word doucment the code inside the parser will get a lot simpler.
Currently we have code to read stylesheets (styles.*) and some code which helps us to read the meta-structures
in Word documents like the PLCF template in word97_helper.* . This code is quite simple and the only thing to
watch out is that using the C++ sizeof()
operator is dangerous. The reason for that is that the
structures in the Word file are "packed", this means there are no padding bytes between variables. In
our generated code we can't achieve that in a portable manner, so we decided not to use it at all. Due to that
reading the whole structure in at once doesn't even work on little endian platforms, but we have the appropriate
read()
methods anyway. Another limitation is that we can't use sizeof()
on the structures
as it almost always will return too large values. For structures where you need that information we can add a
sizeOf
variable (please check the code generation script for more information).
As stated above various times we generate a few thousand lines of code from the HTML specification. The design
of this code is non-existant, it's just a number of structures supporting reading, writing, copying, assignment,
and so on. Some of the structures are partly generated only (like the apply()
method of the main
property structures like PAP
, CHP
, SEP
, and others). Some structues are
commented out, as it would be too hard to generate them. These few structues have to be written manually if
they are needed.
Generally we just parse the specification to get the information out, but sometimes we need a few hints from the programmer to know what to do. These hints are mostly given by adding special comments to the HTML specification. For further information on these hints, and on the available tricks, please have a look at the top of the Perl scripts. The comments are quite detailed and it should be easy to figure out what I intend to do with the hints.
Another way to influence the generated code is to manipulate certain parts of the script itself. You need to do
that to change the order of the structures in the file, disable a structure completely and so on. You can also
select structures to derive from the Shared
class to be able to use the structure with the
SharedPtr
class.
The whole file might need some minor tweaking, a license, #includes
, and maybe even some declarations
or code. This is what the template files in wv2/src/generator are for -- the code gets copied verbatim into
the generated file. Never manipulate a generated file, all your changes will be lost when the code is regenerated!
If you think you found a bug in the specification you can try to correct the HTML file and regenerate the scanner
code using the command make generated
. In case you aren't satisfied with the resulting C++ code, or
if you found a bug in the scripts please contact me. If you aren't scared by a bit of Perl code feel free to fix
or extend the code yourself.
There's not much to say about the unit tests. If you add new code please also add a test for it, or at least tell
me to do so. The header test.h contains a trivial test method and a method to convert integers to strings (as
std::string
doesn't have such functionality).
If you decide to create a unit test please ensure that it's self checking. That means if it runs till the end
everything is allright. If it stops somewhere in between something unexpected happened. Oh, and let me repeat
the warning that UString::ascii()
might produce unexpected results due to the static buffer.
Please send comments, corrections, condolences, patches, and suggestions to Werner Trobin. Thanks in advance. If you really read that document till here I owe you a beverage of your choice next time we meet :-)