Invoice / OCR: Detect two important points in invoice image

I find my data in the invoices based on the x-values of the text.

However I need to know the scale of the invoice and the offset from left/right, before I can do any real calculations with all data that I have retrieved.

What have I tried so far?

1) Making the image monochrome and use the left and right bounds of the first appearance of a black pixel. This fails due to the fact that people can write on invoices.

2) Divide the invoice up in vertical sections, use the sections that have the highest amount of black pixels. Fails due to the fact that the distribution is not always uniform amongst similar templates.

I could really use your help on (1) how to identify important points in invoices and (2) on what I should focus as the important points.

I hope the question is clear enough as it is quite hard to explain.

asked Oct 1, 2013 at 10:12 68.7k 31 31 gold badges 136 136 silver badges 222 222 bronze badges

What fixed parts of the invoice can you rely on? Will the form itself, its black boxes in particular, be used in all scans? Are the gray backgrounds usable as well, or will they be lost on some scans? Will the scale be the same, even if the image should be rotated for scanning, or do you expect scale variations as well?

Commented Oct 1, 2013 at 12:34 What kind of technique you use to locate description and number in the table? Commented Apr 1, 2019 at 16:13

2 Answers 2

Detecting rotation

I would suggest you start by detecting straight lines.

Look (perhaps randomly) for small areas with high contrast, i.e. mostly white but a fair amount of very black pixels as well. Then try to fit a line to these black pixels, e.g. using least squares method. Drop the outliers, and fit another line to the remaining points. Iterate this as required. Evaluate how good that fit is, i.e. how many of the pixels in the observed area are really close to the line, and how far that line extends beyond the observed area. Do this process for a number of regions, and you should get a weighted list of lines.

For each line, you can compute the direction of the line itself and the direction orthogonal to that. One of these numbers can be chosen from an interval [0°, 90°), the other will be 90° plus that value, so storing one is enough. Take all these directions, and find one angle which best matches all of them. You can do that using a sliding window of e.g. 5°: slide accross that (cyclic) region and find a value where the maximal number of lines are within the window, then compute the average or median of the angles within that window. All of this computation can be done taking the weights of the lines into account.

Once you have found the direction of lines, you can rotate your image so that the lines are perfectly aligned to the coordinate axes.

Detecting translation

Assuming the image wasn't scaled at any point, you can then try to use a FFT-based correlation of the image to match it to the template. Convert both images to gray, pad them with zeros till the originals take up at most 1/2 the edge length of the padded image, which preferrably should be a power of two. FFT both images in both directions, multiply them element-wise and iFFT back. The resulting image will encode how much the two images would agree for a given shift relative to one another. Simply find the maximum, and you know how to make them match.

Added text will cause no problems at all. This method will work best for large areas, like the company logo and gray background boxes. Thin lines will provide a poorer match, so in those cases you might have to blur the picture before doing the correlation, to broaden the features. You don't have to use the blurred image for further processing; once you know the offset you can return to the rotated but unblurred version.

Now you know both rotation and translation, and assumed no scaling or shearing, so you know exactly which portion of the template corresponds to which portion of the scan. Proceed.