Sunday, January 01, 2012

Merging Word documents with docx4j

Recently I needed to merge some Word .docx  documents and the tools that we chose for this was docx4j (www.docx4java.org). This is library for Java for working with Microsoft Open XML. There we two things that we needed to accomplish:

  1. Binding XML to various templates
  2. Merging documents as a result of the binding

I will write about merging documents as binding them is already explained well in docx4j site. Author of docx4j also offers commercial package for merging documents but if you want to try it for yourself, here are couple of things that I managed to do and got pretty decent results.

In version 2.7.1 docx4j you can work with java.util.File or java.io.InputStream. First one will do a god job if you have file present in your drive and second one if you keep content in the database (for example).  When merging Word documents you have to take care of relationships in the document itself. There are several elements that have relationships that can span through the document, but we were interested in just a few of them (images, footers and headers). It is worth to mention that if you miss one relationship, your document will be unreadable (in most cases). These are listed as resources and have references that you can use in your paragraphs.

So, to start we would:

  1. Load our initial file in WordprocessingMLPackage (this is the file where we want to attach the rest of the files, so in the end they look as one)
  2. Create unique section template
  3. Reset sections (this will serve the purpose of removing all references from the existing template, remember that section defines page layout)
  4. Remove body section (we can add this in the end)
  5. Loop through the attachment files (if you do not have sections separating pages, you might add page breaks)
  6. Copy relationships that you are interested in
  7. Copy elements
  8. If you do not want page breaks, then you can add empty section
  9. Add body section
  10. Reapply all headers and footers to empty sections

This all might sound complicated, but in the end, once you get to know the structure of the WordprocessingMLPackage, it becomes easier.

These are the code snippets that might be useful:

//...
public class MergeUtil implements IMergeUtil {

    //...

    private void mergeDocxFiles(WordprocessingMLPackage initialFile, List<WordprocessingMLPackage> attachementFiles,
            String outputFile) throws Exception {
        //...
        resetSections(wordMLPackageDest);
        //...
        addEmptySection(wordMLPackageDest, SectionType.PARAGRAPH);

        for (WordprocessingMLPackage wordprocessingMLPackage : attachementFiles) {
            //...
            traverseAndCopyRelationships(wordprocessingMLPackage.getPackage().getRelationshipsPart());
            traverseAndCopyElements(wordprocessingMLPackage.getPackage().getRelationshipsPart(),
                    wordprocessingMLPackage.getMainDocumentPart().getContent());
            //...

        }

        addEmptySection(wordMLPackageDest, SectionType.BODY);
        assignHeaderFooterData(wordMLPackageDest);
        //...                

    }

    //...

    private void addPageBreak() {
        logger.debug("Adding page break");
        org.docx4j.wml.P p = new org.docx4j.wml.P();
        org.docx4j.wml.R r = new org.docx4j.wml.R();
        org.docx4j.wml.Br br = new org.docx4j.wml.Br();
        br.setType(STBrType.PAGE);
        r.getContent().add(br);
        p.getContent().add(r);
        wordMLPackageDest.getMainDocumentPart().addObject(p);
    }

    @SuppressWarnings({ "restriction", "rawtypes" })
        private void traverseAndCopyElements(RelationshipsPart rp, List<Object> content) throws InvalidFormatException {
            for (Object o : content) {

                //...
                findResourceById(rp, ((org.docx4j.dml.picture.Pic) o6).getBlipFill().getBlip()
                        //...
                        .getEmbed());
                findResourceByName(wordMLPackageDest.getPackage().getRelationshipsPart(),
                        imageRelPartName);
                //...  
            }
        }

    //...        
    private void findResourceById(RelationshipsPart rp, String lastId) {
        for (Relationship r : rp.getRelationships().getRelationship()) {
            Part part = rp.getPart(r);
            //...
            if (part.getRelationshipsPart(false) != null) {
                findResourceById(part.getRelationshipsPart(false), lastId);
            }
        }
    }

    private void findResourceByName(RelationshipsPart rp, String imageName) {
        for (Relationship r : rp.getRelationships().getRelationship()) {
            Part part = rp.getPart(r);
            //...
            if (part.getRelationshipsPart(false) != null) {
                findResourceByName(part.getRelationshipsPart(false), imageName);
            }
        }
    }

    private void traverseAndCopyRelationships(RelationshipsPart rp) throws InvalidFormatException {
        for (Relationship r : rp.getRelationships().getRelationship()) {
            Part part = rp.getPart(r);
            if (part != null) {
                //...
                if (part instanceof org.docx4j.openpackaging.parts.WordprocessingML.BinaryPartAbstractImage
                        || part instanceof org.docx4j.openpackaging.parts.WordprocessingML.FooterPart
                        || part instanceof org.docx4j.openpackaging.parts.WordprocessingML.HeaderPart) {
                    //...
                        }

                if (part.getRelationshipsPart(false) != null) {
                    traverseAndCopyRelationships(part.getRelationshipsPart(false));
                }
            }
        }
    }
    //...
    private void resetSections(WordprocessingMLPackage wordMLPackage) throws InvalidFormatException {
        Document doc = (Document) wordMLPackage.getMainDocumentPart().getJaxbElement();

        for (Object o : doc.getBody().getContent()) {
            if (o instanceof org.docx4j.wml.P) {
                if (((org.docx4j.wml.P) o).getPPr() != null) {
                    org.docx4j.wml.PPr ppr = ((org.docx4j.wml.P) o).getPPr();
                    if (ppr.getSectPr() != null) {
                        //...
                    }
                }
            }
        }

        wordMLPackage.getMainDocumentPart().getJaxbElement().getBody().setSectPr(null);
    }
    //...
    private void addEmptySection(WordprocessingMLPackage wordMLPackage, SectionType type) {
        if (type.equals(SectionType.BODY)) {
            org.docx4j.wml.SectPr sectPr = objectFactory.createSectPr();
            sectPr.setPgSz(this.defaultSectionpgSz);
            sectPr.setPgMar(this.defaultSectionpgMar);
            sectPr.setCols(this.defaultSectioncols);
            sectPr.setDocGrid(this.defaultSectiondocGrid);
            wordMLPackage.getMainDocumentPart().getJaxbElement().getBody().setSectPr(sectPr);
        } else {
            //...
        }
    }
    //...
    private void assignHeaderFooterData(WordprocessingMLPackage wordMLPackage) throws InvalidFormatException {
        Document doc = (Document) wordMLPackage.getMainDocumentPart().getJaxbElement();
        int sectionCounter = 0;
        wordMLPackage.getMainDocumentPart().getContent();

        HeaderPart headerPart = new HeaderPart();
        headerPart.setPackage(wordMLPackage);
        headerPart.setJaxbElement(objectFactory.createHdr());
        Relationship rHdr = wordMLPackage.getMainDocumentPart().addTargetPart(headerPart);
        FooterPart footerPart = new FooterPart();
        footerPart.setPackage(wordMLPackage);
        footerPart.setJaxbElement(objectFactory.createFtr());
        Relationship rFtr = wordMLPackage.getMainDocumentPart().addTargetPart(footerPart);

        for (Object o : doc.getBody().getContent()) {
            if (o instanceof org.docx4j.wml.P) {
                if (((org.docx4j.wml.P) o).getPPr() != null) {
                    org.docx4j.wml.PPr ppr = ((org.docx4j.wml.P) o).getPPr();
                    if (ppr.getSectPr() != null) {
                        //...
                        if(!StringUtils.isEmpty(hr.getId()))
                            ppr.getSectPr().getEGHdrFtrReferences().add(hr);
                        //...
                    }
                }
            }
        }

        HeaderReference hr = objectFactory.createHeaderReference();
        hr.setType(HdrFtrRef.DEFAULT);
        FooterReference fr = objectFactory.createFooterReference();
        fr.setType(HdrFtrRef.DEFAULT);
        hr.setId(findRelationshipByTarget(wordMLPackage.getRelationshipsPart(),  String.format("/word/header1_%d.xml", lastHeaderReference)));
        fr.setId(findRelationshipByTarget(wordMLPackage.getRelationshipsPart(),  String.format("/word/footer1_%d.xml", lastHeaderReference)));

        if(!StringUtils.isEmpty(hr.getId()))
            wordMLPackage.getMainDocumentPart().getJaxbElement().getBody().getSectPr().getEGHdrFtrReferences().add(hr);
        if(!StringUtils.isEmpty(fr.getId()))
            wordMLPackage.getMainDocumentPart().getJaxbElement().getBody().getSectPr().getEGHdrFtrReferences().add(fr);
    }

    private String findRelationshipByTarget(RelationshipsPart rp, String target) throws InvalidFormatException {
        //...            
    }
}
Note: All code displayed in upper window is property of Sapiens North America

3 comments:

  1. Hi Cavlin, I'm new with docx4j and I have problem with merging .docx files using docx4j. I read your code snippets but I don't know how to make it works. Can you please send the MergeUtil class and the interface ImergeUtil to mrcancer91@gmail.com? Or give some more code in snippest?
    I'm sorry if my English is bad.
    Thank you very much!

    ReplyDelete
  2. Hi,

    Thanks for leaving the message. Unfortunately I do not have the code anymore as I no longer work for the company where I developed this code. The purpose of the code in this text was to show in general what approach you needed to merge two Word documents. End code is much more complex. Knowledge of the Word document structure is very important to accomplish this. That is why Plutext is offering merge utility commercially as an extension to docx4j. You really need to gather resources from both documents, make sure sections are properly placed and structured, replace or add resources in the document that you want to use as a finished template.

    Regards

    ReplyDelete
  3. Hey, your code really helped me to alter my document. I got a problem, how to insert TOC in second page?

    ReplyDelete