Whenever we consume any packaged item, sometime we wonder how much calories (if we are conscious about it) or nutritional value it will add to our daily diet. And, we want to track those intakes throughout the day and weeks and so on. There are multiple applications available on Android and iOs market place. And, most of them scan the barcode and rely on central database to get this information. I was curious about whether any computer vision tool can be made that ‘reads’ from the ‘Nutrition Facts’ table after its photos is snapped. So, this post is about this attempt.

For initial attempt, I used python to sharpen my python skills 🙂 and I found openCV has straightforward APIs written for python. Also, I find installing other modules in Python is easy. But, biggest disadvantage of using python is it is very slow. And, if I target mobile application then I will have to have faster response time. So, I have todo item to port this to C++, lets see how long I can procrastinate on this.

In summary, here is how whole flow works:

  • Image is converted to gray scale and then prepossessed.
  • In preprocessing step, contrast of the image is increased using CLAHE to accurately detect MSER regions. Then, image is scaled to SVGA size (800×600).
  • MSER regions are detected.
  • MSER regions are filtered according to aspect ratio of BBOX and Stroke-width ratio
  • K-mean algorithm is used to align MSER regions in horizontal direction and cover regions in one line under one BBOX. So, that whole line is read
  • Google’s tesseract API (pytesser) is used to detect text from above BBOX.

For stroke-width ratio calculation, following paper by Huizhong Chen et. al. is used:


Following image is the example input image:



Image size = 1195 X 600 MSER_AREAS = 6620
K-mean START …
K-mean DONE…
[<Recursion on list with id=452914120>,
‘utrition Facts’,
‘Sewing Size 1 cup (2289)’,
‘Servings Per Container 2’,
‘Amount Per Sonny’,
‘calories 250 Calories from Fat 110’,
‘% Daily Value\xe2\x80\x98’,
‘Total Fat 12g 18%’,
‘Saturated Fat 3g 15%’,
‘Trans Fat 3g’,
‘cholesterol 30mg 10%’,
‘Sodium 470mg 20%’,
‘Potassium 700mg 20%’,
‘Total carbohydrate 31g 10%’,
‘Dietary Fiber 0g 0%’,
‘sugars 59’,
‘Protein 5g’,
‘Wamin A’,
‘Wtamin C’,

Following image shows detected text regions:


Future direction:

  1. Detection of text might not be 100% accurate, so top level application might need to use regular expressions or some sort of ‘match score’ to link with correct words.  For example, “Sewing Size” to “Serving Size”
  2. Port it to C++
  3. For current image shown as example, it misses some information (like vitamin %).
  4. Accuracy depends on the quality of image. So, need to improve upon this.
  5. Performance is slower. Most of the time is spent in stroke-width calculation part. Porting to C++ will help, but still need to find alternative approach to stroke-width calculation.

Code snippet:

&amp;amp;amp;lt;pre&amp;amp;amp;gt;import os
import cv2
import scipy.misc as smp
import numpy as np
import json
from pytesser import *
import pprint

#Hardcoded pink color to highlight detected text region
color = (170, 28, 155)
char_height = 20.0
#color = (0, 0, 0)

def bbox (points):
    res = np.zeros((2,2))
    res[0,:] = np.min(points, axis=0)
    res[1,:] = np.max(points, axis=0)
    return res

def bbox_width(bbox):
    return (bbox[1,0] - bbox[0,0] + 1)

def bbox_height(bbox):
    return (bbox[1,1] - bbox[0,1] + 1)

def aspect_ratio(region):
    bb = bbox(region)
    return (bbox_width(bb)/bbox_height(bb))

def filter_on_ar(regions):
    &amp;quot;Filter text regions based on Aspect-ration &amp;amp;amp;amp;lt; 3.0&amp;quot;
    return [x for x in regions if aspect_ratio(x)&amp;amp;amp;amp;lt;3.0]

def dbg_draw_txt_contours(img, mser):
    &amp;quot;Draws contours on original image to show detected text region&amp;quot;
    overlapped_img = cv2.drawContours(img, mser, -1, color)
    new_img = smp.toimage(overlapped_img)

def dbg_draw_txt_rect(img, bbox_list):
   img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR, dstCn=3)
   scratch_image_name = 'nutro.tmp.bmp'
   for b in bbox_list:
      pt1 = tuple(map(int, b[0]))
      pt2 = tuple(map(int, b[1]))
      img = cv2.rectangle(img, pt1, pt2, color, 1)
   new_img = smp.toimage(img)

def preprocess_img(img):
    &amp;quot;Enhance contrast and resize the image&amp;quot;
    # create a CLAHE object (Arguments are optional).
    # It is adaptive localized hist-eq and also avoid noise
    # amplification with cliplimit
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    img = clahe.apply(img)
    #Resize to match SVGA size
    height, width = img.shape
    #SVGA size is 800 X 600
    if width &amp;amp;amp;amp;gt; height:
        scale = 800. / width
        scale = 600. / width
    #Avoid shrinking
    #if scale &amp;amp;amp;amp;lt; 1.0:
    #    scale = 1.0
    dst = cv2.resize(img, (0,0), None, scale, scale, cv2.INTER_LINEAR)
    return dst

def swt_window_func(l):
    center = l[4]
    filtered_l = np.append(l[:4], l[5:])
    res = [n for n in filtered_l if n &amp;amp;amp;amp;lt; center]
    return res

def swt(gimg):
    #TODO: fix threshold logically
    threshold = 90
    maxval = 255
    #THRESH_BINARY_INV because we want to find distance from foreground pixel to background pixel
    temp, bimg = cv2.threshold(gimg, threshold, maxval, cv2.THRESH_BINARY_INV)
    rows, cols = bimg.shape
    #Pad 0 pixel on bottom-row to avoid Infinite distance
    row_2_pad = np.zeros([1, cols], dtype=np.uint8)
    bimg_padded = np.concatenate((bimg, row_2_pad), axis=0)
    dist = cv2.distanceTransform(bimg_padded, cv2.DIST_L2, cv2.DIST_MASK_PRECISE)
    dist = np.take(dist, range(rows), axis=0)
    dist = dist.round()
    #print dist
    it = np.nditer([bimg, dist],
                   flags = ['multi_index', 'multi_index'])

    #Look-up operation
    #while not it.finished:
    lookup = []
    max_col = 0
    max_row = 0
    for cur_b, cur_d in it:
        if it.multi_index[0]&amp;amp;amp;amp;gt;max_row:
            max_row = it.multi_index[0]
        if it.multi_index[1]&amp;amp;amp;amp;gt;max_col:
            max_col = it.multi_index[1]
        if cur_b:
            cur_lup = []
            pval = cur_d
            row = it.multi_index[0]
            if row!=0:
                row_l = row-1
                row_l = row
            if row!=rows-1:
                row_u = row+1
                row_u = row
            row_list = [row_l, row, row_u]
            col = it.multi_index[1]
            if col!=0:
                col_l = col-1
                col_l = col
            if col!=cols-1:
                col_u = col+1
                col_u = col
            col_list = [col_l, col, col_u]
            #TODO: avoid for loop for look-up operation
            for i in row_list:
                for j in col_list:
                    if i!=row and j!=col:
                        cur = dist[i,j]
                        if cur &amp;amp;amp;amp;lt; pval:
    lookup = np.array(lookup)
    lookup= lookup.reshape(rows, cols)
    d_max = int(dist.max())
    for stroke in np.arange(d_max, 0, -1):
        stroke_index = np.where(dist==stroke)
        stroke_index = [(a,b) for a,b in zip(stroke_index[0], stroke_index[1])]
        for stidx in stroke_index:
            neigh_index = lookup[stidx]
            for nidx in neigh_index:
                dist[nidx] = stroke

    sw = []
    for cur_b, cur_d in it:
        if cur_b:
    return sw

def get_swt_frm_mser(region, rows, cols, img):
    &amp;quot;Given image and total rows and columns, extracts SWT values from MSER region&amp;quot;
    bb = bbox(region)
    xmin = int(bb[0][0])
    ymin = int(bb[0][1])
    width = int(bbox_width(bb))
    height = int(bbox_height(bb))
    selected_pix = []
    xmax = xmin + width
    ymax = ymin + height
    for h in range(ymin, ymax):
        row = np.take(img, (h, ), axis=0)
        horz_pix = np.take(row, range(xmin, xmax))
    selected_pix = np.array(selected_pix)
    sw = swt(selected_pix)
    return sw

def filter_on_sw(region_dict):
    filtered_dict = {}
    distance_th = 4.0
    group_num = 0
    for rkey in region_dict.keys():
        med = region_dict[rkey]['sw_med']
        height = bbox_height(region_dict[rkey]['bbox'])
        added = False
        for fkey in filtered_dict:
            for k in filtered_dict[fkey]:
                elem_med = filtered_dict[fkey][k]['sw_med']
                elem_height = bbox_height(filtered_dict[fkey][k]['bbox'])
                m_ratio = med/elem_med
                h_ratio = height/elem_height
                if m_ratio &amp;amp;amp;amp;gt; 0.66 and m_ratio &amp;amp;amp;amp;lt; 1.5 and h_ratio&amp;amp;amp;amp;gt;0.5 and h_ratio &amp;amp;amp;amp;lt; 2.0:
                    filtered_dict[fkey][rkey] = region_dict[rkey]
                    added = True
            if added:
        if not added:
            name = 'group' + str(group_num)
            filtered_dict[name] = {}
            filtered_dict[name][rkey] = region_dict[rkey]
            group_num = group_num + 1
    return filtered_dict

def get_y_center(bb):
        ll = bb[0]
        ur = bb[1]
        return ((ll[1] + ur[1])/2.0)

def kmean(region_dict, rows, num_clusters):
    print &amp;quot;K-mean START ...&amp;quot;
    clusters = (float(rows)/num_clusters) * np.arange(num_clusters)
    cluster_vld = [True] * num_clusters
    #calculate initial cost assuming all regions assigned to cluster-0
    cost = 0.0
    for rkey in region_dict:
        center_y = get_y_center(region_dict[rkey]['bbox'])
        cost += center_y * center_y
    cost = cost/len(region_dict.keys())

    iter_no = 0
    while True:
        iter_no = iter_no + 1
        #Assign cluster-id to each region
        for rkey in region_dict:
            center_y = get_y_center(region_dict[rkey]['bbox'])
            dist_y = np.abs(clusters - center_y)
            cluster_id = dist_y.argmin()
            region_dict[rkey]['clid'] = cluster_id

        #find new cost with assigned clusters
        new_cost = 0.0
        for i, c in enumerate(clusters):
            if cluster_vld[i]:
                num_regions = 0
                cluster_cost = 0.0
                for rkey in region_dict:
                    if(region_dict[rkey]['clid'] == i):
                        center_y = get_y_center(region_dict[rkey]['bbox'])
                        cluster_cost += (center_y - clusters[i]) ** 2
                        num_regions += 1
                if num_regions:
                    cluster_cost /= num_regions
            new_cost += cluster_cost

        #Stop when new cost is within 5% of old cost
        if new_cost &amp;amp;amp;amp;gt;= 0.95 * cost:
            cost = new_cost

        for i, c in enumerate(clusters):
            if cluster_vld[i]:
                num_regions = 0
                clusters[i] = 0.0
                for rkey in region_dict:
                    if(region_dict[rkey]['clid'] == i):
                        center_y = get_y_center(region_dict[rkey]['bbox'])
                        clusters[i] += center_y
                        num_regions += 1
                if num_regions:
                    clusters[i] = clusters[i] / num_regions
                    cluster_vld[i] = False

    #Merge nearby clusters
    for i, cur_cl in enumerate(clusters):
        if cluster_vld[i]:
            for j, iter_cl in enumerate(clusters):
                if abs(cur_cl - iter_cl) &amp;amp;amp;amp;lt;= (char_height/2.0) and i != j:
                    cluster_vld[j] = False
                    for rkey in region_dict:
                        #Update cluster-id to updated one
                        if region_dict[rkey]['clid'] == j:
                            region_dict[rkey]['clid'] = i

    print &amp;quot;K-mean DONE...&amp;quot;
    return cluster_vld

def dbg_get_cluster_rect (cluster_vld, region_dict):
    bbox_list = []
    for cl_no, vld in enumerate(cluster_vld):
        if vld:
            cur_lL = [100000, 100000]
            cur_uR = [-100000, -100000]
            for rkey in region_dict:
                if region_dict[rkey]['clid'] == cl_no:
                    region_lL = region_dict[rkey]['bbox'][0]
                    region_uR = region_dict[rkey]['bbox'][1]
                    #update min/max of x/y
                    if region_lL[0] &amp;amp;amp;amp;lt; cur_lL[0]:
                        cur_lL[0] = region_lL[0]
                    if region_lL[1] &amp;amp;amp;amp;lt;= cur_lL[1]:
                        cur_lL[1] = region_lL[1]
                    if region_uR[0] &amp;amp;amp;amp;gt;= cur_uR[0]:
                        cur_uR[0] = region_uR[0]
                    if region_uR[1] &amp;amp;amp;amp;gt;= cur_uR[1]:
                        cur_uR[1] = region_uR[1]
            bbox_list.append([cur_lL, cur_uR])
    return bbox_list

def get_bbox_img(gimg, bb):
    #print bb, gimg.shape
    y_start = int(bb[0][1])
    y_end = int(bb[1][1])
    x_start = int(bb[0][0])
    x_end = int(bb[1][0])
    #print x_start, x_end, y_start, y_end
    row_extracted = gimg.take(range(y_start, y_end+1), axis=0)
    #print gimg
    extracted = row_extracted.take(range(x_start, x_end+1), axis=1)
    return  extracted

def get_text_from_cluster(cluster_vld, region_dict, gimg):
    bbox_list = dbg_get_cluster_rect(cluster_vld, region_dict)
    #scratch_image_name = 'nutro.tmp.bmp'
    str_list = []
    for bb in bbox_list:
      extracted = get_bbox_img(gimg, bb)
      #print extracted
      ext_img = smp.toimage(extracted)
      found = image_to_string(ext_img, cleanup=False)
    str_list.insert(0, str_list)
    print &amp;quot;TEXT FOUND: &amp;quot;

def run(fimage):
    ar_thresh_max = 3.0
    ar_thresh_min = 0.5
    sw_ratio_thresh = 0.5

    org_img = cv2.imread(fimage)
    gray_img = cv2.cvtColor(org_img, cv2.COLOR_BGR2GRAY)
    gray_img = preprocess_img(gray_img)
    mser = cv2.MSER_create()
    mser_areas = mser.detectRegions(gray_img, None)
    region_dict = {}
    rows, cols = gray_img.shape
    print &amp;quot;Image size = %d X %d  MSER_AREAS = %d path = %s&amp;quot; % (rows, cols, len(mser_areas), fimage)
    region_num = 0
    for m in mser_areas:
        name = 'mser_' + str(region_num)
        bb = bbox(m)
        ar = bbox_width(bb)/bbox_height(bb)
        #Filter based on AspectRatio
        if ar&amp;amp;amp;amp;lt;ar_thresh_max: # and ar&amp;amp;amp;amp;gt;ar_thresh_min: #commented min check because '1' is getting filtered
            #print &amp;quot;SW for region: &amp;quot;, region_num
            sw = get_swt_frm_mser(m, rows, cols, gray_img)
            sw_std = np.std(sw)
            sw_mean = np.mean(sw)
            sw_ratio = sw_std/sw_mean
            #2nd filter based on Stroke-Width
            if sw_ratio&amp;amp;amp;amp;lt;sw_ratio_thresh:
                sw_med = np.median(sw)
                region_dict[name] = {'bbox': bb, 'sw_med': sw_med};
                region_num = region_num + 1

    num_clusters = int(rows/char_height)
    cluster_vld = kmean(region_dict, rows, num_clusters)
    bbox_list = dbg_get_cluster_rect(cluster_vld, region_dict)
    get_text_from_cluster(cluster_vld, region_dict, gray_img)

    cpy_img = np.copy(gray_img)
    dbg_draw_txt_rect(cpy_img, bbox_list)

if __name__ == '__main__':
    db_path = r'&amp;amp;amp;amp;lt;&amp;amp;amp;amp;lt;&amp;amp;amp;amp;lt;PATH NOT SHOWN INTENTIONALLY&amp;amp;amp;amp;gt;&amp;amp;amp;amp;gt;&amp;amp;amp;amp;gt;' ### Edit this line
    img_name = r'good_ex.png'
    #img_name = r'cropped.png'
    #img_name = r'cropped2.png'
    #img_name = r'real_img.png'
    #img_name = r'Real1.JPG'
    fimage = os.path.join(db_path,img_name)




Synthetic Font Dataset Generation

For OCR (optical character recognition) task, single character text images are used to train the machine-learning model. These single character images ranges from hand written text dataset to synthetic text dataset generated using script. Synthetic text dataset is faster way to generate training examples in large quantity. Also, for some applications (e.g. scanning printer generated document), synthetic text dataset may be sufficient.

I have written following python script to generate this dataset. Script tries to generate 12×20 size images of a-z, A-Z and 0-9 character for selected fonts. For the application of my interest, I only need regular English fonts, so I created fonts list in text file (Fonts_list.txt). I get all fonts in my system (for windows it is usually C:\Windows\Fonts\*.ttf) and select only fonts which are listed in Fonts_list.txt. If dataset for all available system-fonts needs to be generated, then script needs to be modified accordingly. Also, instead of getting all fonts from  C:\Windows\Fonts\*.ttf, we can use ttfquery module (ttfquery.findsystem.findFonts()).

For selected font, script will generate each of above character (a-z, A-Z, 0-9) at nine different position to move text by one pixel in right-left and top-bottom direction. Generated image follows following terminology:


Example: arialbd_u_3_H.jpg

I had to use l_u_d_flag to differentiate between image for ‘h’ and ‘H’, otherwise later image used to overwrite the previous one.

Using this terminology will be useful to decide target class, when these images are used for training the machine-learning model.



<pre>from PIL import Image, ImageDraw, ImageFont
import ttfquery.findsystem 
import string
import ntpath
import numpy as np
import os
import glob

fontSize = 20
imgSize = (12,20)
position = (0,0)

#All images will be stored in 'Synthetic_dataset' directory under current directory
dataset_path = os.path.join (os.getcwd(), 'Synthetic_dataset')
if not os.path.exists(dataset_path):

fhandle = open('Fonts_list.txt', 'r')
lower_case_list = list(string.ascii_lowercase)
upper_case_list = list(string.ascii_uppercase)
digits = range(0,10)

for d in digits:

all_char_list = lower_case_list + upper_case_list + digits_list

fonts_list = []
for line in fhandle:

total_fonts = len(fonts_list)
#paths = ttfquery.findsystem.findFonts()
all_fonts = glob.glob("C:\\Windows\\Fonts\\*.ttf")
f_flag = np.zeros(total_fonts)

for sys_font in all_fonts:
   #print "Checking "+p
   font_file = ntpath.basename(sys_font)
   font_file = font_file.rsplit('.')
   font_file = font_file[0]
   f_idx = 0
   for font in fonts_list:
      f_lower = font.lower()
      s_lower = sys_font.lower()
      #Check desired font
      if f_lower in s_lower:
         path = sys_font
         font = ImageFont.truetype(path, fontSize)
         f_flag[f_idx] = 1
         for ch in all_char_list:
            image ="RGB", imgSize, (255,255,255))
            draw = ImageDraw.Draw(image)
            pos_x = 0
            pos_y = 0
            for y in [pos_y-1, pos_y, pos_y+1]:
               for x in [pos_x-1, pos_x, pos_x+1]:
                  position = (x,y)
                  draw.text(position, ch, (0,0,0), font=font)
                  ##without this flag, it creates 'Calibri_a.jpg' even for 'Calibri_A.jpg'
                  ##which overwrites lowercase images
                  l_u_d_flag = "u"
                  if ch.islower():
                     l_u_d_flag = "l"
                  elif ch.isdigit():
                     l_u_d_flag = "d"
                  file_name = font_file + '_' + l_u_d_flag + '_' + str(pos_idx) + '_' + ch + '.jpg'
                  file_name = os.path.join(dataset_path,file_name)
                  pos_idx = pos_idx + 1
      f_idx = f_idx + 1




Synthetic Font Dataset Generation