How to make LaTeX PDF output copy-and-pasteable, searchable and diffable
Have you ever tried to diff a PDF file generated by pdflatex
? Have you ever tried to copy and paste from one? If so, your face probably looked as surprised as mine after you attempted it: it doesn’t work! The characters that make it to your clipboard are gibberish, even though the PDF looks entirely normal.
This same behaviour will bite you if you try to index or search those PDFs. Or if you try to diff them, for example if you manage them using git
.
I can’t tell you what causes it, but I can tell you the solution.
How to generate better PDF files from LaTeX
There are two approaches. The first one worked better for me, but it apparently only works if you use T1 encoding (which is probably everybody this day and age). The solution is to add the line
\usepackage{cmap}
before all other packages.
(In case you’re wondering: to activate T1, add the line \usepackage[T1]{fontenc}
)
I found mention of another approach. This one didn’t work for me, but I’m including it here in case it’s more useful to you.
\input glyphtounicode
\pdfgentounicode=1
Now copy and paste, diff or all other uses of plaintext in PDF generated for LaTeX should work.