AbiWord Development Area
Place content here related to AbiWord Development.
General Description of AbiWord's Layout code
See the Layout Code section for this.
Proposals and Whacky Ideas
Online document reader
I am thinking of creating a document reader based on abiword.
This would mean adding a facility to hyperlink between documents, and having a document location toolbar, and the facility to communicate via hyper text transfer protocol.
There would be no changes server side, but documents would be simply mapped as abiword files using a URL as follows:
Benefits of using abiword over a browser
The benefits of using abiword over a browser is that abiword supports multipage documents, and it is easy to insert page breaks, and other such formatting in abiword.
Use a cooperate code base rather than fork
Ideally it would be useful for the online document reader to use a cooperative codebase with abiword, rather than creating a fork at this stage. This is because abiword still has not matured and some of the document editing facilities, (especially in relation to table editing, etc) are not yet perfected. A fork would mean that the reader would inherit all of the current abiword bugs. The other benefit of using a cooperative code base, is that the insert hyperlink facility could browse existing files, and display anchor points for existing documents on the system, and creating a relative path hyperlink for use by the reader.
Using a common code base
I propose some sort of code base structure as follows:
abiword - This is abiword abireader - This is the document reader abicommon - This is the common code base used by both abiword and abireader
Planning for future versions, etc. Hardcopy of ml or IRC discussions in general:
AbiWord s performance degrades with document size until it becomes barely usable at around 400 pages with reasonable computers.
I think that there are a few areas where we can substantially improve performance. To start with we improved the PieceTable as described below.
Abiword has always used a piece table to store the text (and formatting, more on that later) of a document. The piece table mission is store the text of the document, and allow you to insert, delete and lookup characters from random locations quickly. Another structure that can be used in a minimalistic editor can be simply an array with some free space at the end (like a std::vector). You will be able to lookup very quickly from an array as v[i] is just a direct lookup in memory, a O(1) operation, but insertion will be very slow as to insert in the i position you have first to move all the elements j > i one position to the right. In the worst case, if you insert at i=0, you will have to move N elements (an O(n) operation). You have the same complexity for a delete operation.
As this used to be too slow even for moderately sized documents (I don't know how this will work in modern computers), people started using more complicated algorithms to store text where they can make modifications faster. For instance emacs implemented a "gap array", where they have an array exactly like the one I described, except that the free space (the "gap") was not at the end of the array, but somewhere in the middle (to be more exact, it was at the cursor position). This way to insert a character you add it in the free space and you shrink the gap 1 character. To delete a character you expand the gap 1 character, and to move the cursor position to the left or to the right you just copy 1 character from the left side of the gap to the right side (effectively moving the gap to the left or to the right). As you can imagine, that's not very efficient if you want to make the cursor jump quickly from the beginning of the document to the end, and you get a bit blocked when the gap is exhausted, as emacs has to reallocate the buffer and duplicate it with a new gap (you may haveexperienced this when editing huge documents in emacs).
I think this strategy is not appropriate for editors with scroll bars, where people are used to jump randomly in the document way more than a classic emacs user, but it's simple and it works well enough.
Abiword (as MS Word) uses a "piece table", it's actually fairly simple. You have 2 arrays, one of them is immutable and contains the original string of characters to edit (the "initial array"), the other one (the "change array") is an empty array with some free space at the end (like a std::vector<char>). We will only append characters to the change array.
You also need a list (can be a linked list, more on that later) of "Node"s, each one of them points to a part of the text, they will contain:
- the array that they are pointing to ("initial" or "change")
- the offset of the text inside this array
- the size of the text that they point to
For instance, if your arrays are:
- Initial: [This is an example]
- Change: 
and you have a Node with:
- buffer: Initial
- offset: 5
- size: 2
This node points to the word "is". You start con a list of Nodes of 1 element, the first and only Node will be:
- buffer: Initial
- offset: 0
- size: size of Initial (18 in our example)
- (from now on, I will represent this like [Initial, 0, 18])
This obviously points to the entire original text. When the user inserts a new character, it gets appended to the change buffer, and Nodes are created / splitted to represent the new text. For instance, let's say that you want to insert the character "X" at the end of "This" in the above example, you will end up with the following values in the buffers and in the list of nodes:
- Initial: [This is an example]
- Change: [X]
- [Initial, 0, 4], [Change, 0, 1], [Initial, 4, 14]
To delete a character you don't delete anything in the buffers, you just change a Node from the list of Nodes (splitting it, removing it or shrinking it, depending of the position of the character).
Now, insertion and deletion don't depend anymore on the size of the document, but on the number of modifications that the user has performed in the document. Lookup is also slower, as now it has to find the right piece for a particular offset, again an operation that depends on the number of modifications (O(M)).
That fine for MS Word, but at least in Abiword the pieces have been used not only to store the modifications, but also styles. Ie, a node needs to have a uniform style (font, size, etc.). So an empty document doesn't start with a single Node, but with as many nodes as different spans of text you have in the document. The number of Nodes is thus quite big, and this O(M) complexity in lookup, insert and delete is in practice quite expensive.
Martin did a few years ago a hack to improve lookup, after every mutation to the piece table, he walked the entire list of Nodes (or "pieces") and created a vector that stored these Nodes and their positions. This way you could do a binary search for positions (a O(log(M)) operation. Insert and deletion became even slower and still O((M))
The data structure that Joaquin devised was a way to store this "list of Nodes" in such a way that all the operations are worst case O(log(M)). From here, you can read http://e98cuenc.free.fr/wordprocessor/piecetable.html and it should hopefully make sense.
Abiword has a doubly linked-list of fragments which can be any sort of document structure including paragraph blocks and text structures. Each fragment knows its position in the document and has links to the next and previous fragment.
To speed up searches for character positions we have a cache containing the most recently used fragment and a vector of fragment positions. The combination allows to either find the location immediately and to do binary searches for the location of a given position at speed that decreases as log(N) where N is the number of fragments in the document.
However upon insertions or deletions in the piecetable we need to update every fragment downstream of the change. This operation is order (N) where N is the number of fragments.
For a typical 300 page document Tomas estimates we do about 16000 updates for each insertion/deletion event. That is every time the user presses a key to insert text we need to do 16,000 updates of our document.
A few years ago Joaquin Cuenca Abela developed a prototype Piece Table based on a balanced Red/Black tree structure where all operations (search,insert,delete) require Order(Log(N)) operations.
I estimate that the example only requires 30 operations for Joaquins Piece Table. This is one of the major performance improvements we can make to AbiWord.
This was implemented during the Google Summer of Code 2009 project by Aditya Manthramurthy and Martin Sevior. The code was merged into AbiWord trunk in May 2010. So now AbiWord has a state of the art PieceTable where all searches, inserts and deletes are done in speed O(LOG(N)).
The UpdateLayout methods in the fl_SectionLayout classes
Every time a user presses a key in AbiWord the PieceTable is modified and new ChangeRecord describing the change is created. These ChangeRecords are both saved in an undo stack in the AbiWord document and pointers to them are broadcast to every view attached to the document via the FL_DocListener class.
The change record has a pointer to fl_BlockLayout classes that contains the changed location. This fl_BlockLayout class is called via the appropriate method and updates itself and the physical runs (fp_Run classes) contained by the block.
It then calls a method called "format" which breaks text runs into lines which fit into the container that contains the block.
At end of every single edit, the method fl_DocLayout::updateLayout(..) is called.
This scans the entire document looking for blocks that need to be updated. We can speed things up enormously by making adding pointers to a local cache of blocks in each section that need to be formatted.
This would speed the process so that instead of requiring N operations for each keypress, it only requires O(1-3) operation.
This has now been implemented as of release 2.8.0 for many operations.
This method in fb_ColumnBreaker assembles lines (fp_Line) and tables (fp_TableContainer) onto pages. It needs to iterate through an entire section after a line changes its position or height. The length of time this takes is Order(N) where N is the number of lines.
It might be possible to solve this order(N) problem for a number of cases using Joaquins Red/Black tree but doing so will be very tricky. Especially when taken in conjunction with the complexities of break tables across pages.
A less ambitious goal and one that would make Abiword appear to be faster, is to just update the pages from the beginning of the document (?) until just after the current on screen page during the keystroke processing, then finish the rest of the update during idle between keystrokes.
_Oops—I had trouble understanding Martins original paragraph and made some changes including inserting the phrase "from the beginning of the document" in an attempt to clarify. On further thought ;-), and pending clarification from Martin, I realize he might have meant soemthing like "from the insertion point (for the current change / keystroke)" -- Main.RandyKramer - 22 Mar 2005_
- Note*: a while back we tried to format the layout only until the end of the screen. This caused huge problems (crashes) because the x,y coordinates of various layout elements get queried and used even when the elements are not on screen (not sure why). Leaving part of the layout update into idle time would have potentially similar consequences; this is probably the way to go, but it might be initially more disruptive than it appears.
The current table layout algorithm is adapted from the algorithm used by Gtk+. It is very robust amd handles tables nested to arbitary depth. Unfortunately it takes Order N^2 time to layout tables where is the number of cells. Ive implemented a few hacks to speed things up while typing for abiword-2.2 but the fundamental problem persists.
It may well be worth examining the algorithms used by the gecko layout engine which can position complex tables with thousands of cells extremely quickly.
This section describes some of the issues various plugins have, which are not directly bugs.