Read the content of a word document with PHP.

Today we will try to read a word document with PHP, and when I say read I mean take the text in the document and extract it.

First we create this function:

<?php
    function read_file_docx($filename){

        $striped_content = '';
        $content = '';

        if(!$filename || !file_exists($filename)) return false;

        $zip = zip_open($filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        //echo $content;
        //echo "<hr>";
        //file_put_contents('1.xml', $content);        

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

?>

Second we use the function:

<?php
$filename = "file.docx";
// or /var/www/html/file.docx
 $content = read_file_docx($filename);
 if($content !== false) {        print_r(nl2br($content));    }    
else {        echo 'Couldn\'t the file. Please check that file.';    }
?>

This is able to read a MS word for instance this document:

Screenshot from 2013-08-12 15:58:27

Is read as:

Screenshot from 2013-08-12 16:04:00

As you may see PHP is able to read the content, of course there are some characters that PHP interprets differently from MS word, but maybe working with this as an input you can get what you want as the final result.

Regards

One comment on “Read the content of a word document with PHP.
  1. Kundan Singh says:

    by this we can read only .docx file , but i want read .doc , .pdf , .txt files also.
    how can we do this?
    please answer quickly.

Leave a Reply

Your email address will not be published. Required fields are marked *

*