Come convertire HTML in PDF usando iTextSharp

Come convertire HTML in PDF usando iTextSharp

Innanzitutto, HTML e PDF non sono correlati sebbene siano stati creati più o meno nello stesso periodo. L'HTML ha lo scopo di trasmettere informazioni di livello superiore come paragrafi e tabelle. Sebbene esistano metodi per controllarlo, spetta in definitiva al browser disegnare questi concetti di livello superiore. Il PDF ha lo scopo di trasmettere documenti e i documenti devono "sembrano" uguali ovunque siano renderizzati.

In un documento HTML potresti avere un paragrafo largo al 100% e, a seconda della larghezza del tuo monitor, potrebbero essere necessarie 2 righe o 10 righe e quando lo stampi potrebbero essere 7 righe e quando lo guardi sul tuo telefono potrebbe prendi 20 righe. Un file PDF, tuttavia, deve essere indipendente dal dispositivo di rendering, quindi indipendentemente dalle dimensioni dello schermo deve sempre renderizzare esattamente lo stesso.

A causa dei must sopra, PDF non supporta elementi astratti come "tabelle" o "paragrafi". Ci sono tre cose di base che PDF supporta:testo, linee/forme e immagini. (Ci sono altre cose come annotazioni e filmati, ma qui sto cercando di semplificare.) In un PDF non dici "ecco un paragrafo, browser fai le tue cose!". Invece dici "disegna questo testo in questa esatta posizione X, Y usando questo carattere esatto e non preoccuparti, ho precedentemente calcolato la larghezza del testo, quindi so che si adatterà tutto a questa linea". Inoltre non dici "ecco una tabella" ma invece dici "disegna questo testo in questa posizione esatta e poi disegna un rettangolo in quest'altra posizione esatta che ho calcolato in precedenza, quindi so che sembrerà essere attorno al testo ".

In secondo luogo, iText e iTextSharp analizzano HTML e CSS. Questo è tutto. ASP.Net, MVC, Razor, Struts, Spring, ecc. Sono tutti framework HTML ma iText/iTextSharp ne è ignaro al 100%. Lo stesso vale per DataGridViews, Repeater, Templates, Views, ecc., che sono tutte astrazioni specifiche del framework. È tuo responsabilità di ottenere l'HTML dalla tua scelta di framework, iText non ti aiuterà. Se ottieni un'eccezione dicendo The document has no pages o pensi che "iText non sta analizzando il mio HTML" è quasi certo che in realtà non hai HTML, pensi solo di averlo.

Terzo, la classe incorporata che esiste da anni è il HTMLWorker tuttavia questo è stato sostituito con XMLWorker (Java / .Net). Nessun lavoro è stato svolto su HTMLWorker che non supporta i file CSS e ha solo un supporto limitato per le proprietà CSS più elementari e in realtà si interrompe su determinati tag. Se non vedi l'attributo HTML o la proprietà CSS e il valore in questo file, probabilmente non è supportato da HTMLWorker . XMLWorker a volte può essere più complicato, ma queste complicazioni lo rendono anche più estensibile.

Di seguito è riportato il codice C# che mostra come analizzare i tag HTML in astrazioni iText che vengono aggiunte automaticamente al documento su cui stai lavorando. C# e Java sono molto simili, quindi dovrebbe essere relativamente facile convertirlo. L'esempio n. 1 utilizza il HTMLWorker integrato per analizzare la stringa HTML. Poiché sono supportati solo gli stili in linea, il class="headline" viene ignorato ma tutto il resto dovrebbe effettivamente funzionare. L'esempio n. 2 è lo stesso del primo tranne per il fatto che utilizza XMLWorker invece. L'esempio n. 3 analizza anche il semplice esempio CSS.

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;

//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {

    //Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
    using (var doc = new Document()) {

        //Create a writer that's bound to our PDF abstraction and our stream
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            //Open the document for writing
            doc.Open();

            //Our sample HTML and CSS
            var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
            var example_css = @".headline{font-size:200%}";

            /**************************************************
             * Example #1                                     *
             *                                                *
             * Use the built-in HTMLWorker to parse the HTML. *
             * Only inline CSS is supported.                  *
             * ************************************************/

            //Create a new HTMLWorker bound to our document
            using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

                //HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
                using (var sr = new StringReader(example_html)) {

                    //Parse the HTML
                    htmlWorker.Parse(sr);
                }
            }

            /**************************************************
             * Example #2                                     *
             *                                                *
             * Use the XMLWorker to parse the HTML.           *
             * Only inline CSS and absolutely linked          *
             * CSS is supported                               *
             * ************************************************/

            //XMLWorker also reads from a TextReader and not directly from a string
            using (var srHtml = new StringReader(example_html)) {

                //Parse the HTML
                iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }

            /**************************************************
             * Example #3                                     *
             *                                                *
             * Use the XMLWorker to parse HTML and CSS        *
             * ************************************************/

            //In order to read CSS as a string we need to switch to a different constructor
            //that takes Streams instead of TextReaders.
            //Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
            using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
                using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
                }
            }


            doc.Close();
        }
    }

    //After all of the PDF "stuff" above is done and closed but **before** we
    //close the MemoryStream, grab all of the active bytes from the stream
    bytes = ms.ToArray();
}

//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);

Aggiornamento 2017

Ci sono buone notizie per le richieste da HTML a PDF. Come ha mostrato questa risposta, lo standard del W3C css-break-3 risolverà il problema ... È una Raccomandazione Candidata con un piano per trasformarsi in Raccomandazione definitiva quest'anno, dopo i test.

Come non standard ci sono soluzioni, con plugin per C#, come mostrato da print-css.rocks.

Dal 2018 c'è anche iText7 (Una prossima iterazione della vecchia libreria iTextSharp) e il relativo pacchetto da HTML a PDF disponibile:itext7.pdfhtml

L'utilizzo è semplice:

HtmlConverter.ConvertToPdf(
    new FileInfo(@"Path\to\Html\File.html"),
    new FileInfo(@"Path\to\Pdf\File.pdf")
);

Il metodo ha molti più sovraccarichi.

Aggiornamento: La famiglia di prodotti iText* ha un doppio modello di licenza:gratuito per open source, a pagamento per uso commerciale.

@Chris Haas ha spiegato molto bene come usare itextSharp per convertire HTML a PDF , molto utile
la mia aggiunta è:
Usando HtmlTextWriter Ho inserito i tag html all'interno di HTML tabella + CSS in linea ho ottenuto il mio PDF come volevo senza usare XMLWorker .
Modifica :aggiunta di codice di esempio:
Pagina ASPX:

<asp:Panel runat="server" ID="PendingOrdersPanel">
 <!-- to be shown on PDF-->
 <table style="border-spacing: 0;border-collapse: collapse;width:100%;display:none;" >
 <tr><td><img src="abc.com/webimages/logo1.png" style="display: none;" width="230" /></td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
 <tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
 <tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:11px;color:#10466E;padding:0px;text-align:center;"><i>blablabla</i> Pending orders report<br /></td></tr>
 </table>
<asp:GridView runat="server" ID="PendingOrdersGV" RowStyle-Wrap="false" AllowPaging="true" PageSize="10" Width="100%" CssClass="Grid" AlternatingRowStyle-CssClass="alt" AutoGenerateColumns="false"
   PagerStyle-CssClass="pgr" HeaderStyle-ForeColor="White" PagerStyle-HorizontalAlign="Center" HeaderStyle-HorizontalAlign="Center" RowStyle-HorizontalAlign="Center" DataKeyNames="Document#" 
      OnPageIndexChanging="PendingOrdersGV_PageIndexChanging" OnRowDataBound="PendingOrdersGV_RowDataBound" OnRowCommand="PendingOrdersGV_RowCommand">
   <EmptyDataTemplate><div style="text-align:center;">no records found</div></EmptyDataTemplate>
    <Columns>                                           
     <asp:ButtonField CommandName="PendingOrders_Details" DataTextField="Document#" HeaderText="Document #" SortExpression="Document#" ItemStyle-ForeColor="Black" ItemStyle-Font-Underline="true"/>
      <asp:BoundField DataField="Order#" HeaderText="order #" SortExpression="Order#"/>
     <asp:BoundField DataField="Order Date" HeaderText="Order Date" SortExpression="Order Date" DataFormatString="{0:d}"></asp:BoundField> 
    <asp:BoundField DataField="Status" HeaderText="Status" SortExpression="Status"></asp:BoundField>
    <asp:BoundField DataField="Amount" HeaderText="Amount" SortExpression="Amount" DataFormatString="{0:C2}"></asp:BoundField> 
   </Columns>
    </asp:GridView>
</asp:Panel>

Codice C#:

protected void PendingOrdersPDF_Click(object sender, EventArgs e)
{
    if (PendingOrdersGV.Rows.Count > 0)
    {
        //to allow paging=false & change style.
        PendingOrdersGV.HeaderStyle.ForeColor = System.Drawing.Color.Black;
        PendingOrdersGV.BorderColor = Color.Gray;
        PendingOrdersGV.Font.Name = "Tahoma";
        PendingOrdersGV.DataSource = clsBP.get_PendingOrders(lbl_BP_Id.Text);
        PendingOrdersGV.AllowPaging = false;
        PendingOrdersGV.Columns[0].Visible = false; //export won't work if there's a link in the gridview
        PendingOrdersGV.DataBind();

        //to PDF code --Sam
        string attachment = "attachment; filename=report.pdf";
        Response.ClearContent();
        Response.AddHeader("content-disposition", attachment);
        Response.ContentType = "application/pdf";
        StringWriter stw = new StringWriter();
        HtmlTextWriter htextw = new HtmlTextWriter(stw);
        htextw.AddStyleAttribute("font-size", "8pt");
        htextw.AddStyleAttribute("color", "Grey");

        PendingOrdersPanel.RenderControl(htextw); //Name of the Panel
        Document document = new Document();
        document = new Document(PageSize.A4, 5, 5, 15, 5);
        FontFactory.GetFont("Tahoma", 50, iTextSharp.text.BaseColor.BLUE);
        PdfWriter.GetInstance(document, Response.OutputStream);
        document.Open();

        StringReader str = new StringReader(stw.ToString());
        HTMLWorker htmlworker = new HTMLWorker(document);
        htmlworker.Parse(str);

        document.Close();
        Response.Write(document);
    }
}

ovviamente include iTextSharp Refrences al file cs

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;

Spero che sia di aiuto!
Grazie