OdtToText2.java – Revision 1.2

May 29, 2007

Added support for “sections” in ODT documents.

View the CVS log here


OdtToText2.java (Initial Version Committed)

May 21, 2007

The initial version of OdtToText2.java was committed to the CVS by Bernd Eilers. View the CVS log here

Features

  1. Extracts text, headings from a ODT file
  2. Uses the classes TextBody, BlockContent, Element, etc, in odf.text
  3. No manual SAX parsing

To Do’s

  1. Extend it to extract other information from a ODT file,like table information and etc.

    Test Run:

    • Input – ODT file containing simple text, a heading and List of elements
    • Output :

    DEBUG unhandled elem is org.openoffice.odf.text.UnknownElement node=office:forms
    DEBUG unhandled elem is org.openoffice.odf.text.UnknownElement node=text:sequence-decls
    He heard quiet steps behind him. That didn’t bode well. Who could be following him this late at night and in this deadbeat part of town? And at this particular moment, just after he pulled off the big time and was making off with the greenbacks. Was there another crook who’d had the same idea, and was now watching him and waiting for a chance to grab the fruit of his labor?

    *  aaa
    *  bbb
    *  ccc
    *  ddd
    *  eee
    *  ffff

    ==== Heading ====
    text below heading
    *  aaa
    *  bbb
    *  ccc
    *  ddd
    *  eee
    *  ffff


    Challenge to be taken care of (Bernd Eilers)

    May 19, 2007

    What´s returned as java.util.List at that textBody.getContent() call is
    in fact an instance of class BlockContent which extends
    java.util.AbstractList. What the listIterator() and iterator() methods
    of that AbstractList returns which are curently not overriden in
    BlockContent would likely call the get(int index) method with an
    advancing index for every next() call on the Iterator. The get(int
    index) method which we have in BlockContent now basically starts every
    time at the first childElement advancing until it gets to the index
    element at each call. Adding that together means that the current
    iterator is highly inefficent especially when considering large
    documents. So this means it would be a good idea to implement some inner
    class which implements ListIterator for BlockContent and to override the
    iterator and listIterator methods to return an instance of that class.
    This inner class should keep an pointer into the DOM tree for
    remembering where it is and just call factory.getElement() similar as it
    is done in the get(int index) Method.


    Amit Kumar Saha (JCA)

    May 19, 2007

    My JCA has been approved, which makes me a OpenOfffice.org contributor. The complete list of contrubutors is available here


    #odftoolkit Meeting

    May 19, 2007

    I attended my first IRC meeting at #odftoolkit. Not many turned up due to a public holiday. But discussed few things with Bernd Eilers regarding OdtToText.java


    OdtToText2.java

    May 15, 2007

    Currently working on a re-write of OdtToText.java called as OdtToText2.java, by LO.This will be upwardly compatible as it uses the new classes TextDocument, Body, ElementFactory, etc


    Test Case for TextDocument.java

    May 11, 2007

    I successfully wrote a Test Case(TextDocumentTest.java) for TextDocument.java. It is not yet committed to the CVS as yet.


    Project Membership Confirmed

    May 2, 2007

    I am now a official member of the ODF Toolkit Project. Current Role is “Observer“. However I expect a upgrade to a “Developer” soon.

    See the list of members here


    Getting Started

    May 1, 2007

    May 1, 2007

    • Checked out the CVS and updated my local repository
      • U odf4j/src/org/openoffice/odf/text/BlockContent.java
        U odf4j/src/org/openoffice/odf/text/BlockElement.java
        U odf4j/src/org/openoffice/odf/text/Body.java
        U odf4j/src/org/openoffice/odf/text/Element.java
        U odf4j/src/org/openoffice/odf/text/ElementFactory.java
        U odf4j/src/org/openoffice/odf/text/Heading.java
        U odf4j/src/org/openoffice/odf/text/InlineElement.java
        U odf4j/src/org/openoffice/odf/text/List.java
        U odf4j/src/org/openoffice/odf/text/ListItem.java
        U odf4j/src/org/openoffice/odf/text/Paragraph.java
        U odf4j/src/org/openoffice/odf/text/ParagraphContent.java
        U odf4j/src/org/openoffice/odf/text/Portion.java
        U odf4j/src/org/openoffice/odf/text/Section.java
    • So loads of new additions to the repository by Lars. This seems to be in compliance with the Class Diagram
    • Now will analyze the code and work on the API
    • Set up the project blog

    April 30, 2007

    • Lars informs me of new additions to the CVS
    • I send in the filled up JCA so that I can become a official OpenOffice.org contributor

    April 26, 2007

    • Consulting the OpenOffice Text API and the AODL Text API so that a consistency is maintained by the odf4j Text API
    • For the next 14 days,April 27- May 11, I will be working on finalizing the odf4j Text API

    April 25, 2007

    • Checked out some documents on ODF and especially ODT. This book is an excellent resource. Link
    • I checked out the source code from the CVS using Netbeans and built it . “BUILD SUCCESSFUL“! I tried the OdtToText,java and used it to extract some text from a ODT file . Worked fine!
    • Finalized with Lars (Lars Oppermann) that I am going to work on the functioning of the ODT module for the upcoming 2-3 weeks.
    • Lars sent me the preliminary class diagram.