2008年1月14日 01:21

CAM::PDFを使ったPDF内フォント埋め込み調査スクリプト

スポンサードリンク

PerlのCAM::PDFモジュールを使って、PDFのフォント一覧を出し、特にフォントが埋め込まれているかどうか調べます。

実行例：

$ ./listfont.pl 2.pdf
$VAR1 = [
          {
            'subtype' => 'CIDFontType0',
            'embedded' => 'yes',
            'page' => 1,
            'embeddedtype' => 'Type 1 (CID)',
            'basefont' => 'RAUSIV+GothicBBBPro-Medium',
            'fontfamily' => 'þÿA-OTF N-0´0·0Ã0¯BBB Pro Medium'
          },
          {
            'subtype' => 'CIDFontType2',
            'embedded' => 'yes',
            'page' => 1,
            'embeddedtype' => 'TrueType (CID)',
            'basefont' => 'BCSASB+SymbolMT',
            'fontfamily' => 'Symbol'
          },
          {
            'subtype' => 'TrueType',
            'embedded' => undef,
            'page' => 1,
            'embeddedtype' => undef,
            'basefont' => 'ArialMT',
            'fontfamily' => 'Arial'
          }
        ];
$

ソースは次の通り。

Filename: listfont.pl

#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
use Data::Dumper;
binmode STDOUT => ':utf8';
my $infile = shift;
my $doc = CAM::PDF->new($infile) || die "$CAM::PDF::errstr\n";
 
my $fonts;
for my $page (1 .. $doc->numPages()) {
  foreach my $fontname (sort $doc->getFontNames($page)) {
    my $font = $doc->getFont($page, $fontname);
    $font = parsefont($font);
    push @$fonts,
    {
      page         => $page,
      embedded     => $font->{embedded},
      fontfamily   => decode_fontfamily($font->{fontfamily}),
      basefont     => $doc->getValue($font->{BaseFont}),
      subtype      => $doc->getValue($font->{Subtype}),
      embeddedtype => $font->{embeddedtype},
    };
  }
}
 
print Dumper($fonts);
exit;
 
 
sub parsefont {
  my $font = shift;
  my $fontdescriptor  = $font->{FontDescriptor};
  if ($fontdescriptor) {
    my $ref = $doc->getValue($fontdescriptor);
    $font->{fontfamily} = $doc->getValue($ref->{FontFamily});
    my $fontfile  = $doc->getValue($ref->{FontFile });
    my $fontfile2 = $doc->getValue($ref->{FontFile2});
    my $fontfile3 = $doc->getValue($ref->{FontFile3});
    if ( ($fontfile) || ($fontfile2) || ($fontfile3) ) {
      $font->{embedded} = 'yes';
      $font->{embeddedtype} ="Type 1"         if $fontfile;
      $font->{embeddedtype} ="TrueType (CID)" if $fontfile2;
      $font->{embeddedtype} ="Type 1 (CID)"   if $fontfile3;
    }
    return $font;
  }
  my $descendantfonts = $doc->getValue($font->{DescendantFonts});
  if ($descendantfonts) {
    if (ref $font->{DescendantFonts}->{value}) {
      return parsefont(
               $doc->getObjValue(
                 $font->{DescendantFonts}->{value}->[0]->{value}
               )
             );
    } else {
      return parsefont(
               $doc->getObjValue(
                 $doc->getObjValue(
                   $font->{DescendantFonts}->{value}
                 )->[0]->{value}
               )
             );
    }
  }
}
 
sub decode_fontfamily {
  my $str = shift;
  $str =~s/#([0-9A-F][0-9A-F])/pack('C', hex($1))/eg;
  return $str;
}
 
__END__

再帰処理の部分で、戻り値を2通り持たせているのですけれども、これは、

1647 0 obj<</Subtype/Type0/DescendantFonts[1646 0 R]/BaseFont/DIEAIA+SymbolMT/ToUnicode 1644 0 R/Encoding/Identity-H/Type/Font>>

が

$VAR1 = {
          'DescendantFonts' => bless( {
                                        'gennum' => '0',
                                        'value' => [
                                                     bless( {
                                                              'gennum' => '0',
                                                              'value' => '1646',
                                                              'type' => 'reference',
                                                              'objnum' => '1647'
                                                            }, 'CAM::PDF::Node' )
                                                   ],

と取られ、

9 0 obj<</Subtype/Type0/DescendantFonts 15 0 R/BaseFont/RAUSIV+GothicBBBPro-Medium/Encoding/Identity-H/Type/Font>>

が

$VAR1 = {
          'DescendantFonts' => bless( {
                                        'gennum' => '0',
                                        'value' => '15',
                                        'type' => 'reference',
                                        'objnum' => '9'
                                      }, 'CAM::PDF::Node' ),

みたいに取られまして、

よくわかんないけれども元のPDFで2通りの記述があるのがそのままパースされちゃうので、分岐してみました。

こんな危なっかしいコード書いていることでもわかるとおり、これで決定版というわけではなく（もれなく間違いなく調べられるかどうかは僕も不明ということ）。普通にAcrobatやAdobe ReaderでCtrl+Dしてフォントタブを選択して調査したほうがいいかもしれません。

2008.01.14 01:21 投稿　大野義貴 [Script] | 固定リンク |