Cómo convertir HTML a PDF usando iTextSharp

Cómo convertir HTML a PDF usando iTextSharp

Primero, HTML y PDF no están relacionados, aunque se crearon aproximadamente al mismo tiempo. HTML está destinado a transmitir información de nivel superior, como párrafos y tablas. Aunque existen métodos para controlarlo, en última instancia depende del navegador dibujar estos conceptos de nivel superior. PDF está destinado a transmitir documentos y los documentos deben "se ven" iguales dondequiera que se representen.

En un documento HTML, es posible que tenga un párrafo con un 100 % de ancho y, dependiendo del ancho de su monitor, puede tomar 2 líneas o 10 líneas y cuando lo imprime puede tener 7 líneas y cuando lo mira en su teléfono puede ser tomar 20 líneas. Sin embargo, un archivo PDF debe ser independiente del dispositivo de representación, por lo que, independientemente del tamaño de su pantalla, siempre debe renderizar exactamente igual.

Por los debes arriba, PDF no admite cosas abstractas como "tablas" o "párrafos". Hay tres cosas básicas que admite PDF:texto, líneas/formas e imágenes. (Hay otras cosas como anotaciones y películas, pero estoy tratando de mantenerlo simple aquí). En un PDF no dices "aquí hay un párrafo, ¡el navegador hace lo tuyo!". En su lugar, dice:"dibuje este texto en esta ubicación X, Y exacta usando esta fuente exacta y no se preocupe, calculé previamente el ancho del texto, así que sé que todo encajará en esta línea". Tampoco dice "aquí hay una tabla", sino que dice "dibuje este texto en esta ubicación exacta y luego dibuje un rectángulo en esta otra ubicación exacta que calculé previamente, así sé que parecerá estar alrededor del texto ".

En segundo lugar, iText e iTextSharp analizan HTML y CSS. Eso es todo. ASP.Net, MVC, Razor, Struts, Spring, etc., son todos marcos HTML, pero iText/iTextSharp los desconoce al 100%. Lo mismo con DataGridViews, repetidores, plantillas, vistas, etc., que son abstracciones específicas del marco. Es tu responsabilidad de obtener el HTML de su elección de marco, iText no lo ayudará. Si obtiene una excepción que dice The document has no pages o cree que "iText no está analizando mi HTML", es casi seguro que en realidad no tiene HTML, solo cree que lo tiene.

En tercer lugar, la clase integrada que existe desde hace años es HTMLWorker sin embargo, esto ha sido reemplazado por XMLWorker (Java/.Net). No se está trabajando en HTMLWorker que no admite archivos CSS y solo tiene soporte limitado para las propiedades CSS más básicas y en realidad se rompe en ciertas etiquetas. Si no ve el atributo HTML o la propiedad CSS y el valor en este archivo, probablemente no sea compatible con HTMLWorker . XMLWorker puede ser más complicado a veces, pero esas complicaciones también lo hacen más extensible.

A continuación se muestra el código C# que muestra cómo analizar etiquetas HTML en abstracciones de iText que se agregan automáticamente al documento en el que está trabajando. C# y Java son muy similares, por lo que debería ser relativamente fácil convertir esto. El ejemplo #1 usa el HTMLWorker incorporado para analizar la cadena HTML. Dado que solo se admiten los estilos en línea, el class="headline" se ignora, pero todo lo demás debería funcionar. El ejemplo #2 es el mismo que el primero excepto que usa XMLWorker en cambio. El ejemplo #3 también analiza el ejemplo simple de CSS.

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;

//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {

    //Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
    using (var doc = new Document()) {

        //Create a writer that's bound to our PDF abstraction and our stream
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            //Open the document for writing
            doc.Open();

            //Our sample HTML and CSS
            var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
            var example_css = @".headline{font-size:200%}";

            /**************************************************
             * Example #1                                     *
             *                                                *
             * Use the built-in HTMLWorker to parse the HTML. *
             * Only inline CSS is supported.                  *
             * ************************************************/

            //Create a new HTMLWorker bound to our document
            using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

                //HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
                using (var sr = new StringReader(example_html)) {

                    //Parse the HTML
                    htmlWorker.Parse(sr);
                }
            }

            /**************************************************
             * Example #2                                     *
             *                                                *
             * Use the XMLWorker to parse the HTML.           *
             * Only inline CSS and absolutely linked          *
             * CSS is supported                               *
             * ************************************************/

            //XMLWorker also reads from a TextReader and not directly from a string
            using (var srHtml = new StringReader(example_html)) {

                //Parse the HTML
                iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }

            /**************************************************
             * Example #3                                     *
             *                                                *
             * Use the XMLWorker to parse HTML and CSS        *
             * ************************************************/

            //In order to read CSS as a string we need to switch to a different constructor
            //that takes Streams instead of TextReaders.
            //Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
            using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
                using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
                }
            }


            doc.Close();
        }
    }

    //After all of the PDF "stuff" above is done and closed but **before** we
    //close the MemoryStream, grab all of the active bytes from the stream
    bytes = ms.ToArray();
}

//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);

Actualización de 2017

Hay buenas noticias para las demandas de HTML a PDF. Como mostró esta respuesta, el estándar W3C css-break-3 resolverá el problema ... Es una Candidata a Recomendación con planes de convertirse en Recomendación definitiva este año, después de las pruebas.

Como soluciones no tan estándar, hay complementos para C#, como se muestra en print-css.rocks.

A partir de 2018, también hay iText7 (Una próxima iteración de la antigua biblioteca iTextSharp) y su paquete de HTML a PDF disponible:itext7.pdfhtml

El uso es sencillo:

HtmlConverter.ConvertToPdf(
    new FileInfo(@"Path\to\Html\File.html"),
    new FileInfo(@"Path\to\Pdf\File.pdf")
);

El método tiene muchas más sobrecargas.

Actualización: La familia de productos iText* tiene un modelo de licencia dual:gratis para código abierto, pago para uso comercial.

@Chris Haas ha explicado muy bien cómo usar itextSharp para convertir HTML a PDF , muy útil
mi complemento es:
Usando HtmlTextWriter Puse etiquetas html dentro de HTML tabla + CSS en línea obtuve mi PDF como quería sin usar XMLWorker .
Editar :agregar código de muestra:
Página ASPX:

<asp:Panel runat="server" ID="PendingOrdersPanel">
 <!-- to be shown on PDF-->
 <table style="border-spacing: 0;border-collapse: collapse;width:100%;display:none;" >
 <tr><td><img src="abc.com/webimages/logo1.png" style="display: none;" width="230" /></td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
 <tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
 <tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:11px;color:#10466E;padding:0px;text-align:center;"><i>blablabla</i> Pending orders report<br /></td></tr>
 </table>
<asp:GridView runat="server" ID="PendingOrdersGV" RowStyle-Wrap="false" AllowPaging="true" PageSize="10" Width="100%" CssClass="Grid" AlternatingRowStyle-CssClass="alt" AutoGenerateColumns="false"
   PagerStyle-CssClass="pgr" HeaderStyle-ForeColor="White" PagerStyle-HorizontalAlign="Center" HeaderStyle-HorizontalAlign="Center" RowStyle-HorizontalAlign="Center" DataKeyNames="Document#" 
      OnPageIndexChanging="PendingOrdersGV_PageIndexChanging" OnRowDataBound="PendingOrdersGV_RowDataBound" OnRowCommand="PendingOrdersGV_RowCommand">
   <EmptyDataTemplate><div style="text-align:center;">no records found</div></EmptyDataTemplate>
    <Columns>                                           
     <asp:ButtonField CommandName="PendingOrders_Details" DataTextField="Document#" HeaderText="Document #" SortExpression="Document#" ItemStyle-ForeColor="Black" ItemStyle-Font-Underline="true"/>
      <asp:BoundField DataField="Order#" HeaderText="order #" SortExpression="Order#"/>
     <asp:BoundField DataField="Order Date" HeaderText="Order Date" SortExpression="Order Date" DataFormatString="{0:d}"></asp:BoundField> 
    <asp:BoundField DataField="Status" HeaderText="Status" SortExpression="Status"></asp:BoundField>
    <asp:BoundField DataField="Amount" HeaderText="Amount" SortExpression="Amount" DataFormatString="{0:C2}"></asp:BoundField> 
   </Columns>
    </asp:GridView>
</asp:Panel>

Código C#:

protected void PendingOrdersPDF_Click(object sender, EventArgs e)
{
    if (PendingOrdersGV.Rows.Count > 0)
    {
        //to allow paging=false & change style.
        PendingOrdersGV.HeaderStyle.ForeColor = System.Drawing.Color.Black;
        PendingOrdersGV.BorderColor = Color.Gray;
        PendingOrdersGV.Font.Name = "Tahoma";
        PendingOrdersGV.DataSource = clsBP.get_PendingOrders(lbl_BP_Id.Text);
        PendingOrdersGV.AllowPaging = false;
        PendingOrdersGV.Columns[0].Visible = false; //export won't work if there's a link in the gridview
        PendingOrdersGV.DataBind();

        //to PDF code --Sam
        string attachment = "attachment; filename=report.pdf";
        Response.ClearContent();
        Response.AddHeader("content-disposition", attachment);
        Response.ContentType = "application/pdf";
        StringWriter stw = new StringWriter();
        HtmlTextWriter htextw = new HtmlTextWriter(stw);
        htextw.AddStyleAttribute("font-size", "8pt");
        htextw.AddStyleAttribute("color", "Grey");

        PendingOrdersPanel.RenderControl(htextw); //Name of the Panel
        Document document = new Document();
        document = new Document(PageSize.A4, 5, 5, 15, 5);
        FontFactory.GetFont("Tahoma", 50, iTextSharp.text.BaseColor.BLUE);
        PdfWriter.GetInstance(document, Response.OutputStream);
        document.Open();

        StringReader str = new StringReader(stw.ToString());
        HTMLWorker htmlworker = new HTMLWorker(document);
        htmlworker.Parse(str);

        document.Close();
        Response.Write(document);
    }
}

por supuesto, incluya las referencias de iTextSharp al archivo cs

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;

¡Espero que esto ayude!
Gracias