Image in modal.

The expectations of OCR (Optical Character Recognition) are high. Nowadays, users expect it to recognize and correctly interpret all characters out-of-the-box, much like a human can. In addition, accuracy should remain consistent, regardless of changes in lighting or other environmental conditions. Rule-based approaches have the disadvantage that they lack the scope for interpretation that is acquired in neural network approaches through training with many different examples and explains their enormous performance. However, the use of state-of-the-art technology alone is not enough to ensure the success of a project - it is much more important to make it easy to use, performant and user-friendly to maintain. Because even with OCR systems based on deep learning, quality, execution speed and user-friendliness are anything but a given.

Optical character recognition is still one of the most difficult disciplines in image processing and machine intelligence. The pure variety of possible characters and methods of applying characters to different surfaces gives an idea of the challenges involved. The difficulties of converting such complex visual data into clear, structured text include dirt, reflections and shape errors caused by scratching, embossing or laser engraving on solid materials. In addition, overlapping or incomplete characters, as well as a generally low pixel resolution of the image data, can lead to characters quickly becoming almost impossible to distinguish from each other. For example, the number 8 quickly becomes a 3. The image processing market is constantly evolving to improve the accuracy and reliability of text recognition. But what are the decisive factors when choosing an OCR system?

Comprehensive database with reproducible accuracy

To be convincing from the outset, OCR must function simply and offer high reading performance. This requires a well-developed network architecture that has been pre-trained with many varying sample images. Here, situations from real applications are just as indispensable as the use of synthetic data. On the one hand, many additional special cases and variations can be learned and, on the other, this also ensures far more robust recognition of the relevant features. After all, nothing should be left to chance, especially in industrial automation or quality assurance.

This is where an AI vision solution for individual image analysis must take action. In addition to the use of leading AI technology, users should not only have access to a high-performance OCR model, but also one that is constantly evolving. Strict versioning is essential so that application developers can access all development steps, but also have the option of updating to a new improved version to ensure versatile and robust reading at all times. The possibility of testing and verifying the performance and reproducibility of the trained networks with sample data sets in a kind of quality center should also not be missing. This is essential for quality assurance before a production system is upgraded with new software.

Heavily deformed and small numbers on crown caps.
Tire markings with little contrast.
Information on separating discs with considerable overprinting.
Figure 1: OCR systems should read reliably from the outset in different applications, even without fine-tuning. For example, tire markings with little contrast, heavily deformed and small numbers on crown caps or information on separating discs with considerable overprinting, even with a highly inhomogeneous background.

Of transformers & large language models

Another positive characteristic of a good OCR model is its ability to recognize not only individual characters, but also the relationships between them - in the case of character sequences, such as serial numbers or words - and to take this knowledge into account when recognizing characters. The better the OCR can predict subsequent characters and weight the reading result accordingly, the more robustly and precisely special applications can be solved. The generative and combinatorial properties of transformer networks or large language models (LLM), such as those used in ChatGPT, can have a further positive influence on such predictions and thus on the reading quality. However, it should be considered that these architectures are rather slow in execution and require a lot of system resources. This makes it all the more important that such cutting-edge technologies are used to the right extent in order to optimally support the requirements of customer use cases. This is because image processing does not operate in seconds, but rather in a low millisecond range, especially in the automation sector. A trained neural network should therefore remain fast and lightweight so that it can run on “normal” hardware. If high recognition accuracy and speed in productive use are only possible with almost infinite system performance, applications would hardly be economically viable. 

Simple correction and retraining

If the OCR fails to read characters, regardless of whether the reason was an error or an unknown character, font or language, it is important that the user can correct the reading results or train any new characters with little effort. However, this fine-tuning is not simply a matter of “continuing to train” the network. Imagine, for example, that the OCR model has been trained with 2 million images and the user now wants to introduce something new to the OCR model with a few more images of his own. What weighting should be given to such information in the model in order to make a difference without changing everything that is already there? And this is precisely where the provider’s expertise is required to expand the AI in such a way that this type of adaptation does not negatively impact previous stable recognitions. An example: An OCR has problems with numbers for some reason and the user only annotates numbers in subsequent training processes, never letters. The aim is to achieve an intelligent “knowledge backup” to prevent this network from only being able to read numbers successfully at some point because it thinks it does not need to read letters.

A well-designed training system therefore generates additional synthetic data for all new image data during fine-tuning in order to further train and weight the network to the right extent. This prevents the OCR from losing its previous capabilities, no matter how long it continues to be trained. However, such complex processes during “retraining” are better hidden from the user under a simple user interface and are processed quickly and efficiently in the background using sufficient system resources so that the user is not kept waiting for long periods of time. In the best case scenario, however, the basic capabilities of the OCR model are so good that little or no additional training is required.

Figure 2: Fine-tuning is designed to quickly improve the reading quality of an OCR system with less user interaction or expert knowledge required.
Figure 2: Fine-tuning is designed to quickly improve the reading quality of an OCR system with less user interaction or expert knowledge required.

Advantages of cloud training

A cloud-based AI system with a direct connection to a large data center can provide the necessary resources for complex training routines much more easily and quickly than would be possible with self-hosted hardware. And only when absolutely necessary. This means that training performance can be quickly increased and decreased again as required.

If all the functions and services of an OCR training system are executed entirely in the cloud, any fine-tuning with your own image data always works on an up-to-date and controlled software basis and not with any software release on any local hardware system. Continuous further development in the technical backend also makes the basic OCR model increasingly resistant to difficulties that have already been solved. As a result, more and more customer applications can be realized without major adjustments or additional training. 

The cloud solution is also an added value for the user in a support scenario. If there are difficulties with data in a use case, e.g. with unknown characters, technical support in the backend can quickly provide a remedy and positively influence recognition performance. Without having to export/import data and without the risk of different build systems or software versions leading to different results, changes can be made to the network architecture, for example, or the generation of synthetic additional data can be optimized. In such instances, this can be done directly in the customer use case without any loss of time. Not sending sensitive data also minimizes the risk of unauthorized access by third parties.

In the best case scenario, executing a training command for the cloud system is comparable to a software-as-a-service service that trains a large number of suitable network models with different architectures in the background and can ultimately provide the user with the best result.

Cutting-edge OCR system makes the difference

  • Synthetic data - Each time new images are uploaded, image variants are automatically generated to expand and strengthen model capabilities in a systematic way.
  • Ease of use + time savings - Intuitive tools such as "Autoprediction" and "1-Click Annotation" require no prior knowledge and reduce testing, preparation and maintenance time.
  • Cutting-edge technology - Knowledge of the latest network architectures, such as Transformer or Large Language Models, is continuously incorporated into the development of OCR. 
  • Smart Architecture - Fully automated training independently selects the most suitable architecture for the task
  • Cloud training - Always up to date with latest edge technology and continuous improvement of the network base
  • Fast and economical local execution - The goal is an optimally accurate, lean and fast model for local execution in a closed application environment

OCR simply and economically from a single source

There are many providers of OCR solutions in the AI vision environment and there is a veritable race for the best networks. For experienced users, there are also many open source tools and public network architectures available that can be used to quickly gain initial experience and achieve results. However, without in-depth technical knowledge of how AI technology or cutting edge networks and large vision models can be used and combined economically and efficiently, many OCR tasks remain unsolved. 

Here it is important to have an experienced partner who, at best, can offer all the necessary image processing components for fast, reliable and economical OCR tasks from a single source and adapt them to special requirements. Some customers already benefit from all these advantages. And for those who are still searching -- it costs nothing to try it out.