PDF Import plugin (with style)

From AbiWiki

(Difference between revisions)
Jump to: navigation, search
 
(40 intermediate revisions not shown)
Line 4: Line 4:
=== Synopsis ===
=== Synopsis ===
-
[[AbiWord|AbiWord]] currently has a basic PDF import plugin based on libPoppler. This plugin only gets the plain text from a file and not the styles and their meanings such as the headings. I propose to build a plugin that is able to do just that.
+
AbiWord currently has a basic PDF import plugin based on libPoppler. This plugin only gets the plain text from a file and not the styles and their meanings such as the headings. I propose to build a plugin that is able to do just that.
=== What I propose ===
=== What I propose ===
-
I propose to create an [[AbiWord|AbiWord]] import plugin that takes a PDF file and then analyses it and applies a set of heuristics to it. The result will be a file that is loaded in [[AbiWord|AbiWord]] with markup and styles.
+
I propose to create an AbiWord import plugin that takes a PDF file and then analyses it and applies a set of heuristics to it. The result will be a file that is loaded in AbiWord with markup and style.
A PDF file consists of a number of blocks with text in them, these blocks might be paragraphs or (parts of) sentences. Sometimes even a word is split across multiple blocks. My plugin will apply heuristics (rules of thumb) to these blocks and using these rules add paragraphs together and infer which style was associated with that paragraph. The first heuristic might be "people read from left to right and top to bottom" this will translate into sorting the blocks first on y then on x placement. Then using the heuristic ''the most frequent and unobtrusive style is called normal'' we can aggregate the markup styles and see which style is seen the most and call that style "normal", I mean seen literally as in seen by the user. The styles that are always applied to short paragraphs that are placed above the "normal" paragraphs can be called headings. Of course more rules must be defined, and probably some exceptions too.
A PDF file consists of a number of blocks with text in them, these blocks might be paragraphs or (parts of) sentences. Sometimes even a word is split across multiple blocks. My plugin will apply heuristics (rules of thumb) to these blocks and using these rules add paragraphs together and infer which style was associated with that paragraph. The first heuristic might be "people read from left to right and top to bottom" this will translate into sorting the blocks first on y then on x placement. Then using the heuristic ''the most frequent and unobtrusive style is called normal'' we can aggregate the markup styles and see which style is seen the most and call that style "normal", I mean seen literally as in seen by the user. The styles that are always applied to short paragraphs that are placed above the "normal" paragraphs can be called headings. Of course more rules must be defined, and probably some exceptions too.
I have tried some heuristics in a bash/XSLT proof of concept script that used the xml output of pdf2html and was able to arrange text in the flow it would be read in, discover columns, headings and paragraphs that had an indent on the first line. However to get from that to a regular framework in C++ would still be quite some work.
I have tried some heuristics in a bash/XSLT proof of concept script that used the xml output of pdf2html and was able to arrange text in the flow it would be read in, discover columns, headings and paragraphs that had an indent on the first line. However to get from that to a regular framework in C++ would still be quite some work.
-
This idea won''t create a mathematically proven way of always importing correct PDFs but it will work well because it uses cues that people take trouble to put into their documents. People want the headings to be the same, the titles to be at the beginning and page numbers at the margin and incrementing by one. The visual cues are the only thing that is more or less guaranteed to be applied consequently in any document.
+
This idea won't create a mathematically proven way of always importing correct PDFs but it will work well because it uses cues that people take trouble to put into their documents. People want the headings to be the same, the titles to be at the beginning and page numbers at the margin and incrementing by one. The visual cues are the only thing that is more or less guaranteed to be applied consequently in any document.
-
Because visual cues vary from culture to culture, the heuristics the plugin uses are partly based on defaults and partly on a set that can be selected by the user. The left to right heuristic will not work in Arabic or Chinese writings for example. Some sort of choice might be deducted from the character set or the language that is used, but I haven''t thought of a solution yet, mainly because I''m not experienced in the possible differences between languages.
+
Because visual cues vary from culture to culture, the heuristics the plugin uses are partly based on defaults and partly on a set that can be selected by the user. The left to right heuristic will not work in Arabic or Chinese for example. Some sort of choice might be deducted from the character set or the language that is used, but I haven''t thought of a solution yet, mainly because I'm not experienced in the possible differences between languages.
-
=== Who I am
+
=== Who I am===
-
===
+
I'm a 23 year old student of Information Science at the Utrecht University in the Netherlands. During my bachelor I have focused on usability engineering, software project management, software modeling and basic programming, instead of advanced algorithm construction and code optimization and the like. I usually say that the focus was more on the "other" side of programming. I'm finishing my bachelors' project this month, but I have already started my master. In the Dutch schooling system you sometimes have to wait a long time to take those final classes.
-
I''m a 23 year old student of Information Science at the Utrecht University in the Netherlands. During my bachelor I have focused on usability engineering, software project management, software modeling and basic programming, instead of advanced algorithm construction and code optimization and the like. I usually say that the focus was more on the �other'' side of programming. I''m finishing my bachelors'' project this month, but I have already started my master. In the Dutch schooling system you sometimes have to wait a long time to take those final classes.
+
-
In my masters'' phase, I''m focusing on Content and Knowledge Engineering. This is a discipline that seeks to solve problems related to structured document retrieval and storage and the related issues such as knowledge management. I''m also following the Business Informatics program, this program is basically about all the problems that are related to keeping information systems aligned to the business goals and managing the implementation of new systems. It''s really useful, though not where my heart lies. In the Netherlands it is common for students to have a job next to their study. After some time spent in various jobs in and out of the IT industry I have started up a small freelance business and I am currently phasing out from my regular job and into this one. Examples of customers I have done assignments for during the past few months are the University of Delft and Utrecht and the municipality of Baarn. I try to keep the focus of my assignments within the domain of the Content Knowledge Engineering. I work mainly in web scripting languages, Delphi or java, so I''m not that experienced with lower level languages but when I taught myself programming in high school I started out with assembly, C and C++. Back then I started programming more out of a fascination with computers, to try and learn to understand how they work. Building programs was a way not goal and these languages where the best way to learn the inner workings. Of course when using Linux you have to learn bash scripting and a bit of Perl, and during my study I have to use prolog for knowledge systems and XSLT/XML for knowledge codification.
+
In my masters' phase, I'm focusing on Content and Knowledge Engineering. This is a discipline that seeks to solve problems related to structured document retrieval and storage and the related issues such as knowledge management. I'm also following the Business Informatics program, this program is basically about all the problems that are related to keeping information systems aligned to the business goals and managing the implementation of new systems. It's really useful, though not where my heart lies. In the Netherlands it is common for students to have a job next to their study. After some time spent in various jobs in and out of the IT industry I have started up a small freelance business and I am currently phasing out from my regular job and into this one. Examples of customers I have done assignments for during the past few months are the University of Delft and Utrecht and the municipality of Baarn. I try to keep the focus of my assignments within the domain of the Content Knowledge Engineering. I work mainly in web scripting languages, Delphi or java, so I'm not that experienced with lower level languages but when I taught myself programming in high school I started out with assembly, C and C++. Back then I started programming more out of a fascination with computers, to try and learn to understand how they work. Building programs was a way not goal and these languages where the best way to learn the inner workings. Of course when using Linux you have to learn bash scripting and a bit of Perl, and during my study I have to use prolog for knowledge systems and XSLT/XML for knowledge codification.
-
During a university project last year I had to think of ways to infer metadata from raw documents. We thought about harvesting keywords and adding structural information to documents, but settled on an approach where we built an inference engine to generate new metadata from the existing metadata in related documents because we could only implement one project and we where all pretty exited about that one. Later I read about the PDF document import in [[AbiWord|AbiWord]] and the brainstorming we did on the harvesting/structure adding came back to me. I figured that it should be possible to recognize the formatting markup of the text and use it combined with some basic heuristics on reading to infer paragraphs and their styles.
+
During a university project last year I had to think of ways to infer metadata from raw documents. We thought about harvesting keywords and adding structural information to documents, but settled on an approach where we built an inference engine to generate new metadata from the existing metadata in related documents because we could only implement one project and we where all pretty exited about that one. Later I read about the PDF document import in AbiWord and the brainstorming we did on the harvesting/structure adding came back to me. I figured that it should be possible to recognize the formatting markup of the text and use it combined with some basic heuristics on reading to infer paragraphs and their styles.
-
The Summer of Code looked like the perfect opportunity to implement this because the subsidy would mean that I don''t actively have to pursue new assignments for a while, the summer would mean that I don''t have classes, and the project itself would mean that I have a solid base on which I can continue to work in my spare time. I might even be able to start some other project on top of it for my masters'' thesis next year.
+
The Summer of Code looked like the perfect opportunity to implement this because the subsidy would mean that I don't actively have to pursue new assignments for a while, the summer would mean that I don't have classes, and the project itself would mean that I have a solid base on which I can continue to work in my spare time. I might even be able to start some other project on top of it for my masters' thesis next year.
-
=== What I will deliver
+
=== What I will deliver ===
-
===
+
At the end of the project I will deliver a plugin for AbiWord that will be able to read basic PDF files like scientific papers. It will load them and display them with markup and at least have interpreted the following: Headings, page numbers, columns and the table of contents. The plugin will be a good enough framework that other functions can be implemented later on. A framework for the support for non-western documents will also be built in, but the heuristics for it probably not. PDFs with extravagant functions or really non standard markup like posters might not be imported perfectly. We might wish to compile a set of test documents that have to pass at the end of the project.
-
At the end of the project I will deliver a plugin for [[AbiWord|AbiWord]] that will be able to read basic PDF files like scientific papers. It will load them and display them with markup and at least have interpreted the following: Headings, page numbers, columns and the table of contents. The plugin will be a good enough framework that other functions can be implemented later on. A framework for the support for non-western documents will also be built in, but the heuristics for it probably not. PDFs with extravagant functions or really non standard markup like posters might not be imported perfectly. We might wish to compile a set of test documents that have to pass at the end of the project.
+
-
=== A rough timeline
+
=== A rough timeline ===
-
===
+
Since I will be working on the project on my own and since there is a real possibility that I will make some wrong assumptions about the implementation at the beginning I'll start using throwaway prototyping. With this methodology I won't have errors from the beginning working on ?till the finished product, when they are really hard to fix. I will develop separately (1) a routine that creates a document that can be passed to AbiWord, (2) a PDF importing routine that loads the information we need in a way that we can work with it, (3) a routine that applies separately defined heuristics to a sample text and (4) a way of selecting a subset of heuristics based on properties (probably tags). The problem with a bottom up approach like prototyping is that the architecture might not be very well thought out, because the focus is not wide enough. So when I have enough knowledge of the process I will reverse and begin a top down cycle, starting with (5) describing the global architecture in UML, (6) implementing it and then (7) filling in the blank spaces step by step. This second part will make sure I deliver a solid framework that can be worked on and not just a bunch of code that even I don''t know how it works anymore.
-
Since I will be working on the project on my own and since there is a real possibility that I will make some wrong assumptions about the implementation at the beginning I''ll start using throwaway prototyping. With this methodology I won''t have errors from the beginning working on �till the finished product, when they are really hard to fix. I will develop separately (1) a routine that creates a document that can be passed to [[AbiWord|AbiWord]], (2) a PDF importing routine that loads the information we need in a way that we can work with it, (3) a routine that applies separately defined heuristics to a sample text and (4) a way of selecting a subset of heuristics based on properties (probably tags). The problem with a bottom up approach like prototyping is that the architecture might not be very well thought out, because the focus is not wide enough. So when I have enough knowledge of the process I will reverse and begin a top down cycle, starting with (5) describing the global architecture in UML, (6) implementing it and then (7) filling in the blank spaces step by step. This second part will make sure I deliver a solid framework that can be worked on and not just a bunch of code that even I don''t know how it works anymore.
+
-
The prototypes I''d like to finish in one to two weeks each, I give myself five weeks for all of them and any other research I find I need to perform.
+
The prototypes I'd like to finish in one to two weeks each, I give myself five weeks for all of them and any other research I find I need to perform.
Then the midterm evaluation hits. I will have four prototypes which show that the separate steps work, so the feasibility of the project can be tested.
Then the midterm evaluation hits. I will have four prototypes which show that the separate steps work, so the feasibility of the project can be tested.
(5): This will be done during and after the prototype test. I will have it finished during the sixth/seventh week.  
(5): This will be done during and after the prototype test. I will have it finished during the sixth/seventh week.  
-
(6): Converting the UML model to a code framework is trivial and largely automated. This won''t really take up much time.
+
(6): Converting the UML model to a code framework is trivial and largely automated. This won't really take up much time.
(7): The implementation will be a successive set of steps that I have six more weeks to implement. The precise timeline will have to be based on the architecture I create in the previous weeks.
(7): The implementation will be a successive set of steps that I have six more weeks to implement. The precise timeline will have to be based on the architecture I create in the previous weeks.
-
=== What is in it for [[AbiWord|AbiWord]]
+
=== What is in it for AbiWord ===
-
===
+
-
[[AbiWord|AbiWord]] is a solid cross platform word processor at the moment. However it hasn''t got the immense feature set of MS Word or [[OpenOffice|OpenOffice]].org. Instead of focusing on supporting all of MS Words features and thus always staying one step behind [[AbiWord|AbiWord]] follows its own route and comes up with original functionality such as real time collaboration.
+
-
PDF import will fit in this strategy because it''s both one of the most requested features from people receiving PDF documents, and has not been implemented by any other party. Usually it is even said to be impossible.
+
AbiWord is a solid cross platform word processor at the moment. However it hasn't got the immense feature set of MS Word or OpenOffice.org. Instead of focusing on supporting all of MS Words features and thus always staying one step behind AbiWord follows its own route and comes up with original functionality such as real time collaboration.
-
PDF import will allow people that use [[AbiWord|AbiWord]] on devices with a non-standard size (phones/ PDA''s) to reflow the document. Knowledge of headings allows people to insert tables of contents into their documents and of course the ability to load a document and edit some parts of it is always useful.
+
PDF import will fit in this strategy because it's both one of the most requested features from people receiving PDF documents, and has not been implemented by any other party. Usually it is even said to be impossible.
-
I understand that one of the main considerations at the time you are reading this proposal is whether or not the project will be completed at the end of the summer. I have experience with planning software projects, meeting deadlines, and preventing things like feature creep. I am also able to spend the summer working on this project and I have the ambition to use this project as the stepping stone for my masters'' thesis next year.
+
PDF import will allow people that use AbiWord on devices with a non-standard size (phones/ PDA's) to reflow the document. Knowledge of headings allows people to insert tables of contents into their documents and of course the ability to load a document and edit some parts of it is always useful.
-
I hope that with this proposal I have been able to fully explain the merits and functions of this software project. If there are any questions left unanswered, please don''t hesitate to call, IM or mail me at the addresses I put at the top of this proposal.
+
I understand that one of the main considerations at the time you are reading this proposal is whether or not the project will be completed at the end of the summer. I have experience with planning software projects, meeting deadlines, and preventing things like feature creep. I am also able to spend the summer working on this project and I have the ambition to use this project as the stepping stone for my masters' thesis next year.
-
==== Contributors
+
 
-
====
+
I hope that with this proposal I have been able to fully explain the merits and functions of this software project. If there are any questions left unanswered, please don't hesitate to call, IM or mail me at the addresses I put at the top of this proposal.
-
* Main.[[DomLachowicz|DomLachowicz]] - 24 May 2006
+
 
-
[[Category:To Convert]]
+
[[Category:Summer of Code]]
 +
[[Category:SoC Proposals]]

Current revision as of 19:05, 26 December 2010

PDF Import plugin (with style)

By Jauco Noordzij

Contents

Synopsis

AbiWord currently has a basic PDF import plugin based on libPoppler. This plugin only gets the plain text from a file and not the styles and their meanings such as the headings. I propose to build a plugin that is able to do just that.

What I propose

I propose to create an AbiWord import plugin that takes a PDF file and then analyses it and applies a set of heuristics to it. The result will be a file that is loaded in AbiWord with markup and style. A PDF file consists of a number of blocks with text in them, these blocks might be paragraphs or (parts of) sentences. Sometimes even a word is split across multiple blocks. My plugin will apply heuristics (rules of thumb) to these blocks and using these rules add paragraphs together and infer which style was associated with that paragraph. The first heuristic might be "people read from left to right and top to bottom" this will translate into sorting the blocks first on y then on x placement. Then using the heuristic the most frequent and unobtrusive style is called normal we can aggregate the markup styles and see which style is seen the most and call that style "normal", I mean seen literally as in seen by the user. The styles that are always applied to short paragraphs that are placed above the "normal" paragraphs can be called headings. Of course more rules must be defined, and probably some exceptions too. I have tried some heuristics in a bash/XSLT proof of concept script that used the xml output of pdf2html and was able to arrange text in the flow it would be read in, discover columns, headings and paragraphs that had an indent on the first line. However to get from that to a regular framework in C++ would still be quite some work. This idea won't create a mathematically proven way of always importing correct PDFs but it will work well because it uses cues that people take trouble to put into their documents. People want the headings to be the same, the titles to be at the beginning and page numbers at the margin and incrementing by one. The visual cues are the only thing that is more or less guaranteed to be applied consequently in any document.

Because visual cues vary from culture to culture, the heuristics the plugin uses are partly based on defaults and partly on a set that can be selected by the user. The left to right heuristic will not work in Arabic or Chinese for example. Some sort of choice might be deducted from the character set or the language that is used, but I havent thought of a solution yet, mainly because I'm not experienced in the possible differences between languages.

Who I am

I'm a 23 year old student of Information Science at the Utrecht University in the Netherlands. During my bachelor I have focused on usability engineering, software project management, software modeling and basic programming, instead of advanced algorithm construction and code optimization and the like. I usually say that the focus was more on the "other" side of programming. I'm finishing my bachelors' project this month, but I have already started my master. In the Dutch schooling system you sometimes have to wait a long time to take those final classes.

In my masters' phase, I'm focusing on Content and Knowledge Engineering. This is a discipline that seeks to solve problems related to structured document retrieval and storage and the related issues such as knowledge management. I'm also following the Business Informatics program, this program is basically about all the problems that are related to keeping information systems aligned to the business goals and managing the implementation of new systems. It's really useful, though not where my heart lies. In the Netherlands it is common for students to have a job next to their study. After some time spent in various jobs in and out of the IT industry I have started up a small freelance business and I am currently phasing out from my regular job and into this one. Examples of customers I have done assignments for during the past few months are the University of Delft and Utrecht and the municipality of Baarn. I try to keep the focus of my assignments within the domain of the Content Knowledge Engineering. I work mainly in web scripting languages, Delphi or java, so I'm not that experienced with lower level languages but when I taught myself programming in high school I started out with assembly, C and C++. Back then I started programming more out of a fascination with computers, to try and learn to understand how they work. Building programs was a way not goal and these languages where the best way to learn the inner workings. Of course when using Linux you have to learn bash scripting and a bit of Perl, and during my study I have to use prolog for knowledge systems and XSLT/XML for knowledge codification. During a university project last year I had to think of ways to infer metadata from raw documents. We thought about harvesting keywords and adding structural information to documents, but settled on an approach where we built an inference engine to generate new metadata from the existing metadata in related documents because we could only implement one project and we where all pretty exited about that one. Later I read about the PDF document import in AbiWord and the brainstorming we did on the harvesting/structure adding came back to me. I figured that it should be possible to recognize the formatting markup of the text and use it combined with some basic heuristics on reading to infer paragraphs and their styles.

The Summer of Code looked like the perfect opportunity to implement this because the subsidy would mean that I don't actively have to pursue new assignments for a while, the summer would mean that I don't have classes, and the project itself would mean that I have a solid base on which I can continue to work in my spare time. I might even be able to start some other project on top of it for my masters' thesis next year.

What I will deliver

At the end of the project I will deliver a plugin for AbiWord that will be able to read basic PDF files like scientific papers. It will load them and display them with markup and at least have interpreted the following: Headings, page numbers, columns and the table of contents. The plugin will be a good enough framework that other functions can be implemented later on. A framework for the support for non-western documents will also be built in, but the heuristics for it probably not. PDFs with extravagant functions or really non standard markup like posters might not be imported perfectly. We might wish to compile a set of test documents that have to pass at the end of the project.

A rough timeline

Since I will be working on the project on my own and since there is a real possibility that I will make some wrong assumptions about the implementation at the beginning I'll start using throwaway prototyping. With this methodology I won't have errors from the beginning working on ?till the finished product, when they are really hard to fix. I will develop separately (1) a routine that creates a document that can be passed to AbiWord, (2) a PDF importing routine that loads the information we need in a way that we can work with it, (3) a routine that applies separately defined heuristics to a sample text and (4) a way of selecting a subset of heuristics based on properties (probably tags). The problem with a bottom up approach like prototyping is that the architecture might not be very well thought out, because the focus is not wide enough. So when I have enough knowledge of the process I will reverse and begin a top down cycle, starting with (5) describing the global architecture in UML, (6) implementing it and then (7) filling in the blank spaces step by step. This second part will make sure I deliver a solid framework that can be worked on and not just a bunch of code that even I dont know how it works anymore.

The prototypes I'd like to finish in one to two weeks each, I give myself five weeks for all of them and any other research I find I need to perform.

Then the midterm evaluation hits. I will have four prototypes which show that the separate steps work, so the feasibility of the project can be tested.

(5): This will be done during and after the prototype test. I will have it finished during the sixth/seventh week. (6): Converting the UML model to a code framework is trivial and largely automated. This won't really take up much time. (7): The implementation will be a successive set of steps that I have six more weeks to implement. The precise timeline will have to be based on the architecture I create in the previous weeks.

What is in it for AbiWord

AbiWord is a solid cross platform word processor at the moment. However it hasn't got the immense feature set of MS Word or OpenOffice.org. Instead of focusing on supporting all of MS Words features and thus always staying one step behind AbiWord follows its own route and comes up with original functionality such as real time collaboration.

PDF import will fit in this strategy because it's both one of the most requested features from people receiving PDF documents, and has not been implemented by any other party. Usually it is even said to be impossible.

PDF import will allow people that use AbiWord on devices with a non-standard size (phones/ PDA's) to reflow the document. Knowledge of headings allows people to insert tables of contents into their documents and of course the ability to load a document and edit some parts of it is always useful.

I understand that one of the main considerations at the time you are reading this proposal is whether or not the project will be completed at the end of the summer. I have experience with planning software projects, meeting deadlines, and preventing things like feature creep. I am also able to spend the summer working on this project and I have the ambition to use this project as the stepping stone for my masters' thesis next year.

I hope that with this proposal I have been able to fully explain the merits and functions of this software project. If there are any questions left unanswered, please don't hesitate to call, IM or mail me at the addresses I put at the top of this proposal.

Personal tools