Files

.github
001-trivial
002-trivial-libre-office-writer
003-pdflatex-image
004-pdflatex-4-pages
005-libreoffice-writer-password
006-pdflatex-outline
007-imagemagick-images
008-reportlab-inline-image
009-pdflatex-geotopo
010-pdflatex-forms
011-google-doc-document
012-libreoffice-form
013-reportlab-overlay
014-outlines
015-arabic
- README.md
- habibi-oneline-cmap.pdf
- habibi-rotated.pdf
- habibi.html
- habibi.pdf
- rotate.py
016-libre-office-link
017-unreadable-meta-data
018-base64-image
019-grayscale-image
020-xmp
021-pdfa
022-pdfkit
023-cmyk-image
024-annotations
025-attachment
026-latex-multicolumn
.gitignore
.pre-commit-config.yaml
LICENSE
Makefile
README.md
files.json

015-arabic

Apr 16, 2023

d3d2503 · Apr 16, 2023

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	MISC: Update PyPDF2 to pypdf (#19 )	Jan 22, 2023
habibi-oneline-cmap.pdf	habibi-oneline-cmap.pdf	ENH: Add PDF with Arab text (#13 )	Jul 17, 2022
habibi-rotated.pdf	habibi-rotated.pdf	Add rotated arabic file	Apr 16, 2023
habibi.html	habibi.html	ENH: Add PDF with Arab text (#13 )	Jul 17, 2022
habibi.pdf	habibi.pdf	ENH: Add PDF with Arab text (#13 )	Jul 17, 2022
rotate.py	rotate.py	Add rotated arabic file	Apr 16, 2023

README.md

Arabic script for testing text extraction

habibi.pdf was generated using weasyprint 54.1-3 on debian unstable in July 2022, using the following command:

weasyprint habibi.html habibi.pdf

CMap Structure

habibi-oneline-cmap.pdf is the same file, but the beginbfchar stanza of the ToUnicode CMap is written with ASCII space delimiters between <srcString> <dstString> pairings, rather than newlines. That is, where habibi.pdf contains:

6 beginbfchar
<0003> <>
<03f2> <>
<0392> <>
<03f4> <>
<02f4> <>
<03a3> <062d064e0628064a0628064a0020>
endbfchar

habibi-oneline-cmap.pdf contains:

6 beginbfchar
<0003> <> <03f2> <> <0392> <> <03f4> <> <02f4> <> <03a3> <062d064e0628064a0628064a0020>
endbfchar

Otherwise the two files are exactly identical.

I believe text extraction should behave the same way on both files. From what i understand of the PDF specification, they are syntactically equivalent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

015-arabic

015-arabic

README.md

Arabic script for testing text extraction

CMap Structure

Files

015-arabic

Directory actions

More options

Directory actions

More options

Latest commit

History

015-arabic

Folders and files

parent directory

README.md

Arabic script for testing text extraction

CMap Structure