.docx
is the file extension for files created using the default format of Microsoft Word 2007 or higher. This is the Microsoft Office Open XML WordProcessingML format. This format is based around a zipped collection of eXtensible Markup Language (XML) files. Microsoft Office Open XML WordProcessingML is mostly standardized in ECMA 376 and ISO 29500.
Formerly, Microsoft used the BIFF (Binary Interchange File Format) binary format (.xls
, .doc
, .ppt
). It now uses the OOXML (Office Open XML) format. These files (.xlsx
, .xlsm
, .docx
, .docm
, .pptx
, .pptm
) are zipped-XML.
.docx
is the new default Word format, it cannot contain any VBA (for security reasons as stated by Microsoft). .docm
is the new Word format that can store VBA and execute macros.
The .docx
format is a zipped file that contains the following folders:
+--docProps
| + app.xml // Contains the name of the software used for ]
| | // creating the document, the number of pages, characters,
| | // and some other configuration
| \ core.xml // Contains the name of the creator of the document,
| // the revision, and the last modification date.
+--word
| // This folder contains most of the files
| // that control the content of the document
|
| + document.xml // Contains the main content of the document.
| + endnotes.xml
| + fontTable.xml
| + footer1.xml // Contains the content of the footer of the document
| | // (there can be multiple footers called footer1.xml, footer2.xml, ...)
| + footnotes.xml
| +--media // This folder contains all images embedded in the word document
| | \ image1.jpeg
| + settings.xml
| + styles.xml
| + stylesWithEffects.xml
| +--theme
| | \ theme1.xml
| + webSettings.xml
| \--_rels
| \ document.xml.rels // This document tells where each image is located,
| // and all Relationships
+ [Content_Types].xml
\--_rels
\ .rels
The main content of a docx file resides in word/document.xml
.
A typical word/document.xml
looks like this :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document>
<w:body>
<w:p w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidRDefault="0059122C" w:rsidP="0059122C">
<w:r>
<w:t>Hello </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidR="008B4316">
<w:t>W</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t>orld</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="001A6335" w:rsidRPr="0059122C" w:rsidSect="001A6335">
<w:headerReference w:type="default" r:id="rId7"/>
<w:footerReference w:type="default" r:id="rId8"/>
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
<w:document>
The tags are w:body
(for the whole document), and then the document is separated in multiple w:p
(paragraphs). And a w:sectPr
, which defines the headers/footers used for that document.
Inside a w:p
, there are multiple w:r
(runs). Every run defines its own style (color of the text, font-size, …), and every run contains multiple w:t
(text parts).
As you can see, a simple sentence like Hello World
might be separated in multiple w:t
, which makes many things difficult to implement.
When you generate a template, there are basically two steps :
/*
* Step1 : compilation (only takes the zip file as an argument and will check
* the template for errors, such as if the template misses a closing tag, for
* example, with the following template :
*
* ```docx
* Hello {name
* ```
*
* An error will be thrown to tell that there is an unclosed tag.
*
* For a valid document, such as :
*
* ```docx
* Hi {user}
* ```
*
* It will return a valid compiled document.
*/
const doc = new Docxtemplater(zip, {
paragraphLoop: true,
linebreaks: true,
});
/*
* Step2 : rendering (add data to a compiled document)
* This takes a compiled document and fills in all placeholders.
*/
doc.render({
description: "New Website",
});
What happens during both these steps is shown below :
Docxtemplater will process each xml file which might contain user content.
In most cases, it will be "word/document.xml", and also "word/header1.xml", "word/header2.xml", …
Each of those files goes through multiple steps :
module.preparse()
, module.matchers()
, module.parse()
, and module.postparse()
) into several Partsdoc.render(data)
and call module.render(part)
for each part in the Parts array.Here's an example with a very simple template :
This exact syntax is used in our tests to check that the software does exactly what is done, and those internals are verified by the tests.
The following is first created, which we call the xmllexed
array :
/* eslint-disable-next-line no-unused-vars */
const xmllexed = [
{
position: "start",
tag: "w:t",
text: true,
type: "tag",
value: "<w:t>",
},
{
type: "content",
value: "Hi {user}",
},
{
position: "end",
tag: "w:t",
text: true,
type: "tag",
value: "</w:t>",
},
];
It has only handled the "xml" tags.
We've chosen a flat structure for our XML parsing instead of a nested structure, in order to easily handle all types of tags regardless of what tags should be ignored. This is one of our core principle in docxtemplater : we only touch things that we have to, the rest of the XML will be exactly the same as before, this ensures high fidelity output.
In a second step, the delimiters ({
and }
for the default configuration) are parsed and they get a specific object.
/* eslint-disable-next-line no-unused-vars */
const lexed = [
{
type: "tag",
position: "start",
value: "<w:t>",
text: true,
tag: "w:t",
},
{ type: "content", value: "Hi ", position: "insidetag" },
{ type: "delimiter", position: "start" },
{ type: "content", value: "user", position: "insidetag" },
{ type: "delimiter", position: "end" },
{
type: "tag",
value: "</w:t>",
text: true,
position: "end",
tag: "w:t",
},
];
A third step is to parse the start and end delimiters into placeholders ({type: "placeholder"}
) in our data structure :
/* eslint-disable-next-line no-unused-vars */
const parsed = [
{
type: "tag",
position: "start",
value: "<w:t>",
text: true,
tag: "w:t",
},
{ type: "content", value: "Hi ", position: "insidetag" },
{ type: "placeholder", value: "user" },
{
type: "tag",
value: "</w:t>",
text: true,
position: "end",
tag: "w:t",
},
];
If a variable in the template starts with a "@" sign, such as {@input}
, the tag will be transformed into a rawxml
tag. See Rawxml Tag syntax. This is done using the module.matchers()
API. This is done here in the code for the RawXML Module. The module.matchers returns : [["@", "rawxml"]]
which means that the {@input}
will be transformed into : { type: "placeholder", value: "input", module: "rawxml" }
.
The last step is that we call module.postparse(parsed)
(for each module), which will transform the array.
For example, for the RawXmlModule, in the postparse
function, we will expand the current tag to the current paragraph, because a RawXMLTag will replace the whole current paragraph, thus the paragraph needs to be "embedded" into that part.
This is done here in the RawXmlModule.
/* eslint-disable-next-line no-unused-vars */
const postparsed = [
{
type: "tag",
position: "start",
value: '<w:t xml:space="preserve">',
text: true,
tag: "w:t",
},
{ type: "content", value: "Hi ", position: "insidetag" },
{ type: "placeholder", value: "user" },
{
type: "tag",
value: "</w:t>",
text: true,
position: "end",
tag: "w:t",
},
];
Once all these steps are finished (for all xml files that should be templated), the document is compiled
All the code above is run when you call :
/* eslint-disable-next-line no-unused-vars, no-undef */
const doc = new Docxtemplater(zip, options);
During the rendering phase, the compiled template receives its data.
It will call module.render
for each module and for each part, ie :
/* eslint-disable no-undef */
const parts = [
{
type: "tag",
position: "start",
value: '<w:t xml:space="preserve">',
text: true,
tag: "w:t",
},
{ type: "content", value: "Hi ", position: "insidetag" },
{ type: "placeholder", value: "user" },
{
type: "tag",
value: "</w:t>",
text: true,
position: "end",
tag: "w:t",
},
];
const options = {
filePath: "word/document.xml",
};
for (const part of parts) {
for (mod of modules) {
const moduleRendered = mod.render(part, options);
if (moduleRendered) {
return moduleRendered;
}
}
}
There is a module that's attached by default called the "Render" Module which will render standard placeholders (which have no "module" key).