Synthetic Font Dataset Generation

For OCR (optical character recognition) task, single character text images are used to train the machine-learning model. These single character images ranges from hand written text dataset to synthetic text dataset generated using script. Synthetic text dataset is faster way to generate training examples in large quantity. Also, for some applications (e.g. scanning printer generated document), synthetic text dataset may be sufficient.

I have written following python script to generate this dataset. Script tries to generate 12×20 size images of a-z, A-Z and 0-9 character for selected fonts. For the application of my interest, I only need regular English fonts, so I created fonts list in text file (Fonts_list.txt). I get all fonts in my system (for windows it is usually C:\Windows\Fonts\*.ttf) and select only fonts which are listed in Fonts_list.txt. If dataset for all available system-fonts needs to be generated, then script needs to be modified accordingly. Also, instead of getting all fonts from  C:\Windows\Fonts\*.ttf, we can use ttfquery module (ttfquery.findsystem.findFonts()).

For selected font, script will generate each of above character (a-z, A-Z, 0-9) at nine different position to move text by one pixel in right-left and top-bottom direction. Generated image follows following terminology:

<font>_<l_u_d_flag>_<position_0_9>_<character>.jpg

Example: arialbd_u_3_H.jpg

I had to use l_u_d_flag to differentiate between image for ‘h’ and ‘H’, otherwise later image used to overwrite the previous one.

Using this terminology will be useful to decide target class, when these images are used for training the machine-learning model.


Calibri
Times
Arial
Aparaj
Agency
Bell
Brln
Bod_
Book
CALIST
Cambria
Candara
Century
Consola
Constan
Corbel
DokChamp
Elephnt
Euphemia
FRAD
FRAH
FRAMD
FRAB
Gadugi
Gara
Georgia
Impact
MS
Poor
Verdana

Script:

<pre>from PIL import Image, ImageDraw, ImageFont
import ttfquery.findsystem 
import string
import ntpath
import numpy as np
import os
import glob

fontSize = 20
imgSize = (12,20)
position = (0,0)

#All images will be stored in 'Synthetic_dataset' directory under current directory
dataset_path = os.path.join (os.getcwd(), 'Synthetic_dataset')
if not os.path.exists(dataset_path):
   os.makedirs(dataset_path)

fhandle = open('Fonts_list.txt', 'r')
lower_case_list = list(string.ascii_lowercase)
upper_case_list = list(string.ascii_uppercase)
digits = range(0,10)

digits_list=[]
for d in digits:
   digits_list.append(str(d))

all_char_list = lower_case_list + upper_case_list + digits_list

fonts_list = []
for line in fhandle:
   fonts_list.append(line.rstrip('\n'))

total_fonts = len(fonts_list)
#paths = ttfquery.findsystem.findFonts()
all_fonts = glob.glob("C:\\Windows\\Fonts\\*.ttf")
f_flag = np.zeros(total_fonts)

for sys_font in all_fonts:
   #print "Checking "+p
   font_file = ntpath.basename(sys_font)
   font_file = font_file.rsplit('.')
   font_file = font_file[0]
   f_idx = 0
   for font in fonts_list:
      f_lower = font.lower()
      s_lower = sys_font.lower()
      #Check desired font
      if f_lower in s_lower:
         path = sys_font
         font = ImageFont.truetype(path, fontSize)
         f_flag[f_idx] = 1
         for ch in all_char_list:
            image = Image.new("RGB", imgSize, (255,255,255))
            draw = ImageDraw.Draw(image)
            pos_x = 0
            pos_y = 0
            pos_idx=0
            for y in [pos_y-1, pos_y, pos_y+1]:
               for x in [pos_x-1, pos_x, pos_x+1]:
                  position = (x,y)
                  draw.text(position, ch, (0,0,0), font=font)
                  ##without this flag, it creates 'Calibri_a.jpg' even for 'Calibri_A.jpg'
                  ##which overwrites lowercase images
                  l_u_d_flag = "u"
                  if ch.islower():
                     l_u_d_flag = "l"
                  elif ch.isdigit():
                     l_u_d_flag = "d"
                  file_name = font_file + '_' + l_u_d_flag + '_' + str(pos_idx) + '_' + ch + '.jpg'
                  file_name = os.path.join(dataset_path,file_name)
                  image.save(file_name)
                  pos_idx = pos_idx + 1
      f_idx = f_idx + 1

</pre>

Output:

wordpress

Synthetic Font Dataset Generation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s