Apache POI: Extract Text from Word File in Java

In this tutorial, we'll look at two different ways to read a Word file in java that has the doc or docx extension, and how to extract their text contents.

Before you begin, you need to download the Apache POI API.

Reading a Word file .doc with WordExtractor

Apache POI uses the WordExtractor which allows the extraction of the contents of the entire Word file into each page of a doc file. Be certain that the file has the .doc extension and dates MS-Word 97-2003. The WordExtractor class allows you to extract all the text in a Word document, including paragraphs, tables, headers, and footers in each page.

You must use the getParagraphText() or getText() to retrieve the text from the Word file, in the form of an array in which each box in the array contains a paragraph of type String.

The class HWPFDocument Accepts a playback stream to the file as a parameter. This class plays the role of a bucket as we put the entire structure of the document in.

import java.io.*; 
import org.apache.poi.hwpf.HWPFDocument; 
import org.apache.poi.hwpf.extractor.WordExtractor; 

public class LireDoc
{
 public static void main(String[] args)
 {
 try
 {
 File file = new File("nouveaudoc.doc"); 
 FileInputStream fis = new FileInputStream(file.getAbsolutePath()); 
 HWPFDocument document = new HWPFDocument(fis); 
   WordExtractor extractor = new WordExtractor(document); 
 String[] text = extractor.getParagraphText(); 
 for (int i = 0; i < texte.length; i++)
 {
 if (texte[i] != null)
 System.out.println(texte[i]); 
 }
 }
 catch (Exception e)
 {
 e.printStackTrace(); 
 }
 }
}

Reading a Word .docx file with XWPFWordExtractor

The Microsoft Office 2007 document uses the .docx format and stores the information (text, style, color, font, and so on) in an XML file. The Class XWPFWordExtractor is used to extract the text from this file named OOXML.

To get the String text, call the XWPFWordExtractor.getText(). It is clear that the .docx format is easier to read than the old format .doc.

import java.io.*; 
import org.apache.poi.xwpf.extractor.XWPFWordExtractor; 
import org.apache.poi.xwpf.usermodel.XWPFDocument; 

public class ReadDocx
{
 public static void main(String[] args)
 {
 try
 {
 File file = new File("nouveaudoc.docx"); 
 FileInputStream fis = new FileInputStream(file.getAbsolutePath()); 
 XWPFDocument document = new XWPFDocument(fis); 
   XWPFWordExtractor extractor = new XWPFWordExtractor(document); 
 String text = extractor.getText(); 
 System.out.println(text); 
 }
 catch (Exception e)
 {
 e.printStackTrace(); 
 }
 }
}