Converting Word Documents to RTF, HTML, TXT, XML

In this tutorial you'll learn how to open and convert Word documents (.doc) to the popular formats RTF, HTML, TXT and XML. You will also find yourself learning the basics of operating with COM objects.

Advertisement

Opening and manipulating Microsoft Office Word documents (.doc) can be done rather easily using the .NET Framework. You are capable of opening, editing and creating Word documents with only a few lines of code. However, since classes for managing the Word document format are not available in the .NET Framework, the solution is to reference COM objects into your project. The downside of this is that to be able to manage the Word documents with the application we’re going to create in this tutorial, the user running it will need to have Microsoft Word installed, preferably the same version that we designed the application for.
In this tutorial the application was designed and tested to work with Microsoft Office version 11, more exactly Microsoft Office Word 2003. On other recent versions, the application is likely to work but it may require a few changes, especially the Open() and SaveAs() functions which probably differ. Therefore if you find the project attached doesn’t work on your system, and you don’t have Microsoft Office 2003 installed, that’s probably the cause.
Just to make things clear: there is a way to open, edit and save Word documents without requiring the Word application to be installed, however the task of building such an application would require an entire team of experienced programmers where a language such as C++ might prove more efficient, since it involves creating your application from scratch, i.e. to create your own .doc parser – unless you find a 3rd party component that does that.

Start by creating a C# Windows application project. Add a total of 6 buttons and one label. Name them btnOpenbtnClosebtnToHtmlbtnToRTFbtnToTextbtnToXml and the label lblFilePath. Disable the four convert buttons and the close button (btnClose) by setting the Enabled property to false. We will enable them once the user chooses a file to convert. Now there’s two more controls you need to add to the project, via the Visual Studio Toolbox: an OpenFileDialog and a SaveFileDialog. Name them openDoc and saveDoc. The first dialog (openDoc) we will use to open the MS Word Document that we want to convert, thus we want to restrict the user to choosing only a Microsoft Word type of document (.doc), and to do that go ahead and change the Filter property of the OpenFileDialog to the following value:

Word Document|*.doc

This assures us that the user will only be able to select a Word Document. For more details on this object, please see the Using OpenFile Dialog to open files tutorial.
As for the other dialog – saveDoc, we’re not going to define a filter right now, because the file type to which we’re going to save depends on what button the user clicks (To HTML, To RTF, etc.). We’re going to define the filter when the user clicks the button, because at that time we know the extension.

Now let’s start doing what we need to do to open an Word document. Right click the project name in Solution Explorer and choose Add Reference. Switch to the COM tab and scroll down until you find Microsoft Word 11.0 Object Library. If you don’t have this item listed, you probably don’t have Microsoft Office installed so unfortunately the tutorial ended for you here. In case you see a different version of the object library such as Microsoft Word 10.0 Object Library or Microsoft Word 9.0 Object Library, it means you have an older version of Office. Normally you should be able to adjust the code from this tutorial to match your Word version, easily.

After you add the Word Object Library to your project, in Solution Explorer you will see some new items were added:

Now that we have Microsoft.Office.CoreVBIDE and Word added as a reference we are ready to start coding. Switch to code view, and the first thing we want to do is create three objects in the Form1 class, right above the constructor:

private Word. ApplicationClass WordApp;
private Word.Document WordDoc;
private object DocNoParam = Type .Missing;

The first object is the Word Application Class, which we can access thanks to the COM reference we added earlier. We’re going to use this to start the Microsoft Word engine, which will do the work of converting the document to the other formats. WordApp will also be the one opening the document; the document will then be stored inside WordDoc – which is the the second object we create.
The third object seems kind of odd – it’s an object of the type Missing. The functions we are going to call for opening and saving the document will take a handful or parameters, but we’ll only want to specify a few of them. For the other parameters that we don’t have any values to pass to, we’re going to pass this missing object – as in “parameter is missing”.
The reason for this small inconvenience is that the COM object was meant to be used mainly with the VisualBasic language where there is no method overloading, overriding or constructors. Visual Basic is also more permissive and allows the user to skip some parameters. In C# we can’t skip these parameters and we’ll have to specify a missing parameter, similar to specifying null.

Now that we have these objects ready, we can open the Word document. To do that, double-click btnOpen to create its Click event handler. Use the following code:

private void btnOpen_Click(object sender, EventArgs e)
{
   // Create an instance of the Word Application

   WordApp = new Word.ApplicationClass ();
   // We don't want to display the Microsoft Word window
WordApp.Visible = false;
  
 // If the user choosed a path of the file to open
   if (this.openDoc.ShowDialog() == DialogResult.OK)
   {

      // Set the label to the new file path
lblFilePath.Text = openDoc.FileName;
      // Enable the convert and close buttons, since now we have a document opened

      btnToHtml.Enabled = true;
      btnToRTF.Enabled = true;
      btnToText.Enabled = true;
      btnToXml.Enabled = true;
      btnClose.Enabled = true;

      // Create and set the objects we're going to pass to the Open() function     
object DocFileName = openDoc.FileName;
     object DocReadOnly = false;
     object DocVisible = true;

      // Open the document by passing the path

      WordDoc = WordApp.Documents.Open(ref DocFileName, ref DocNoParam, ref
DocReadOnly, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocVisible, ref DocNoParam, ref
DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam);
      WordDoc.Activate();

   }
}

The above code opens the Word document specified by the user in the OpenFileDialog window, enables the convert and close buttons and sets the label to the path of the file just so that we remember which file is opened.
As we discussed before, we pass a handful of values to the parameters of the Documents.Open method, but to most of them with pass the reference of DocNoParam which contains Type.Missing, meaning plain and simple that we don’t want to pass anything to that parameter. The Office COM object was designed with the Visual Basic language in mind, that’s why this line in Visual Basic would be about 10 times shorter since we would only have to pass values to the parameters that we are interested in.

Now that we have the Word document opened and we can manipulate it as you we want, let’s accomplish the main task of our program and save this document with different formats. The first button is supposed to save to HTML, so double-click it to get to the click event handler and use the following code:

btnToHtml_Click(object sender, EventArgs e)
{
   
// Suggest a path for saving
 saveDoc.FileName = @"C:\Test Document.html";
    // The file extension to which we want to save
 saveDoc.Filter = "HTML Files|*.html";
   
// If the user choosed a path where to save the file
if(this.saveDoc.ShowDialog() == DialogResult.OK)
   {
       // Set the save path object
object SaveToPath = saveDoc.FileName;
      // Set the format type to HTML (wdFormatHTML)
      object SaveToFormat = Word.WdSaveFormat.wdFormatHTML;
      // Save the document to the specified path and format

      WordDoc.SaveAs(refSaveToPath, ref SaveToFormat, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, ref
 DocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam);
   }
}

As you can see in the code above, when btnToHtml is clicked we prompt the user to save the document in the HTML format. The whole magic is in the object SaveToFormat = Word.WdSaveFormat.wdFormatHTML; line where specify the format we wish to use. In this case we specify wdFormatHTML to save the file as an HTML document. Upon clicking this button, the document will be converted from its specific .doc format to HTML tags. Along with the HTML file, sometimes there is also a folder created that holds the pictures for that document, referenced in the HTML document.

From the remaining 3 buttons the code get repetitive, with only a few changes to adjust the different extension.

The C# code for converting to RTF:


private void btnToRTF_Click(objectsender, EventArgs e)

{
   // Suggest a path for saving
saveDoc.FileName = @"C:\Test Document.rtf";
   // The file extension to which we want to save
saveDoc.Filter = "RTF Files|*.rtf";
 

 // If the user choosed a path where to save the file
if(this.saveDoc.ShowDialog() == DialogResult.OK)
   {
    // Set the save path object
 object SaveToPath = saveDoc.FileName;
      // Set the format type to RTF (wdFormatRTF)
      object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatRTF;
      // Save the document to the specified path and format
WordDoc.SaveAs(refSaveToPath, refSaveToFormat, ref DocNoParam, ref DocNoParam, refDocNoParam, refDocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, refDocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam);
   }
}

The C# code for converting to plain text:


private void btnToText_Click(object sender, EventArgs e)

{
   // Suggest a path for saving
saveDoc.FileName = @"C:\Test Document.txt";
   // The file extension to which we want to save
saveDoc.Filter = "Text Files|*.txt";
  
// If the user choosed a path where to save the file
if(this.saveDoc.ShowDialog() == DialogResult.OK)
   {
      // Set the save path object
object SaveToPath = saveDoc.FileName;
   // Set the format type to TXT (wdFormatText)
object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatText;
      // Save the document to the specified path and format

      WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam);
   }
}

The C# code for converting to XML:


private void btnToXml_Click(object sender, EventArgs e)

{
   // Suggest a path for saving
saveDoc.FileName = @"C:\Test Document.xml";
   // The file extension to which we want to save
saveDoc.Filter = "XML Files|*.xml";
   // If the user choosed a path where to save the file
if(this.saveDoc.ShowDialog() == DialogResult.OK)
   {
      // Set the save path object
      object SaveToPath = saveDoc.FileName;
      // Set the format type to XML (wdFormatXML)
object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatXML;
      // Save the document to the specified path and format
WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref
DocNoParam, ref DocNoParam)

   }

There’s one last thing we need to do. Unless we close each document after we open it, instances of WinWord.exe will remain in memory, so obviously you’ll want to press the close button before opening another document or closing the application. In the click event handler of btnClose we tell Word to close the document and to not save any changes:

private void btnClose_Click(object sender, EventArgs e)
{
   // Since we don't want to save changes to the original document
   object SaveChanges = false;
   // Close the document, save no changes
 WordDoc.Close(ref SaveChanges, ref DocNoParam, ref DocNoParam);
}

Here is the entire application code in case you want to have an overall look:

using System;
using System.Collections.Generic;
using System.ComponentModel;

using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;

namespace OpenWord
{
  public partial class Form1Form
   {
      private Word.ApplicationClass WordApp;
      private Word.Document WordDoc;
      private object DocNoParam = Type.Missing;
      public Form1()
      {
        InitializeComponent();
      }
      private void btnOpen_Click(object sender, EventArgs e)
      {
         // Create an instance of the Word Application
         WordApp = new Word.ApplicationClas();
         // We don't want to display the Microsoft Word window
WordApp.Visible = false;
         // If the user choosed a path of the file to open
 if(this.openDoc.ShowDialog() == DialogResult.OK)
         {
            // Set the label to the new file path
lblFilePath.Text = openDoc.FileName;
          // Enable the convert and close buttons, since now we have a document
opened


            btnToHtml.Enabled = true;
            btnToRTF.Enabled = true;
            btnToText.Enabled = true;
            btnToXml.Enabled = true;
            btnClose.Enabled = true;
            // Create and set the objects we're going to pass to the Open() function
object DocFileName = openDoc.FileName;
 object DocReadOnly = false;
 object DocVisible = true;
            
// Open the document by passing the path
  WordDoc = WordApp.Documents.Open(ref DocFileName, ref DocNoParam, ref DocReadOnly, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocVisible, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam);
            WordDoc.Activate();

         }
      }
      private void btnToHtml_Click(object sender, EventArgs e)
      {
         // Suggest a path for saving
saveDoc.FileName = @”C:\Test Document.html”;
         // The file extension to which we want to save
         saveDoc.Filter = “HTML Files|*.html”;

         // If the user choosed a path where to save the file
         if(this.saveDoc.ShowDialog() == DialogResult.OK)
         {
            // Set the save path object
       object SaveToPath = saveDoc.FileName;
            // Set the format type to HTML (wdFormatHTML)
object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatHTML;
            // Save the document to the specified path and format
WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam);
         }
      }
      private void btnToRTF_Click(object sender, EventArgs e)
      {
         // Suggest a path for saving
saveDoc.FileName = @”C:\Test Document.rtf”;
         // The file extension to which we want to save
         saveDoc.Filter = “RTF Files|*.rtf”;
         // If the user choosed a path where to save the file
         if(this.saveDoc.ShowDialog() == DialogResult.OK)
         {
            // Set the save path object
object SaveToPath = saveDoc.FileName;
            // Set the format type to RTF (wdFormatRTF)
object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatRTF;
            // Save the document to the specified path and format
WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam);
         }
      }
      private void btnToText_Click(object sender, EventArgs e)
      {
         // Suggest a path for saving
saveDoc.FileName = @”C:\Test Document.rtf”;
         // The file extension to which we want to save
saveDoc.Filter = “RTF Files|*.rtf”;
         // If the user choosed a path where to save the file
if(this.saveDoc.ShowDialog() == DialogResult.OK)
         {
            // Set the save path object
            object SaveToPath = saveDoc.FileName;
            // Set the format type to RTF (wdFormatRTF)
           object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatText;
            // Save the document to the specified path and format
WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref
DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref
DocNoParam, ref DocNoParam, ref DocNoParam);
         }
      }
      private void btnToXml_Click(object sender, EventArgs e)
      {
         // Suggest a path for saving
saveDoc.FileName = @”C:\Test Document.xml”;
 // The file extension to which we want to save
saveDoc.Filter = “XML Files|*.xml”;
  // If the user choosed a path where to save the file

         if(this.saveDoc.ShowDialog() == DialogResult.OK)
         {
            // Set the save path object
 object SaveToPath = saveDoc.FileName;
            // Set the format type to XML (wdFormatXML)
            object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatXML;
            // Save the document to the specified path and format
            WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref
 DocNoParam, ref DocNoParam);
         }
      }

      private void btnClose_Click(object sender, EventArgs e)
      {
         // Since we don’t want to save changes to the original document
         object SaveChanges = false;
  // Close the document, save no changes

  WordDoc.Close(ref SaveChanges, ref DocNoParam, ref DocNoParam);
      }
   }
}
Nathan Pakovskie is an esteemed senior developer and educator in the tech community, best known for his contributions to Geekpedia.com. With a passion for coding and a knack for simplifying complex tech concepts, Nathan has authored several popular tutorials on C# programming, ranging from basic operations to advanced coding techniques. His articles, often characterized by clarity and precision, serve as invaluable resources for both novice and experienced programmers. Beyond his technical expertise, Nathan is an advocate for continuous learning and enjoys exploring emerging technologies in AI and software development. When he’s not coding or writing, Nathan engages in mentoring upcoming developers, emphasizing the importance of both technical skills and creative problem-solving in the ever-evolving world of technology. Specialties: C# Programming, Technical Writing, Software Development, AI Technologies, Educational Outreach

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top