OCR language support reference and tips for different scripts

OCR 多语言支持参考及不同文字系统的识别技巧

OCR accuracy varies greatly across different languages and scripts — wrong settings lead to unusable results.

Understanding which language models to select and how to optimize settings for each script system maximizes recognition accuracy.

01 Supported languages overview

Tesseract.js inherits the powerful multilingual capabilities of the Tesseract OCR engine, supporting text recognition in over 100 languages. Below are the most commonly used languages and their characteristics.

Each language has a corresponding trained model file that is automatically downloaded on first use (typically a few MB). Once downloaded, models are cached by the browser and don't need to be re-fetched.

  • English (eng) — Latin script, highest recognition rate
  • Chinese Simplified (chi_sim) — Simplified Chinese characters for mainland documents
  • Chinese Traditional (chi_tra) — Traditional Chinese characters for HK/TW documents
  • Japanese (jpn) — Hiragana, Katakana, and Kanji characters
  • Korean (kor) — Hangul syllabic script
  • French (fra) — Latin script with accented characters
  • Spanish (spa) — Latin script with special characters ñ, ¿, ¡
  • German (deu) — Latin script with umlauts ä, ö, ü, ß
  • Russian (rus) — Cyrillic script
  • Arabic (ara) — Right-to-left writing system

02 CJK character recognition tips

CJK characters pose a far greater challenge for OCR engines than Latin scripts due to their massive character sets and complex strokes. Achieving good results requires attention to several key factors.

First, image resolution is critical. Because CJK strokes are dense, low-resolution images easily cause strokes to merge or blur. Ensure text height in the image is at least 30 pixels.

Second, correctly distinguishing between Simplified and Traditional Chinese is essential. Using the wrong model not only lowers accuracy but can produce many incorrect characters. For Japanese documents with heavy Kanji usage, consider loading both Japanese and Chinese models.

Finally, vertical text layouts (common in traditional Chinese and Japanese typesetting) may yield less accurate results than horizontal text. When possible, rotate images to horizontal orientation before processing.

For Chinese OCR, use images at least 300 DPI. If capturing from screen, zoom to 200% before taking the screenshot.

03 Handling mixed-language documents

In practice, many documents use multiple languages. Technical documents often mix Chinese and English, academic papers may include Latin or Greek symbols, and business documents may span several languages.

Tesseract.js allows loading multiple language models simultaneously to handle mixed-language documents. Simply select all relevant languages in the language picker. For example, for a Chinese-English document, select both "English" and "Chinese Simplified".

Be aware that the more language models loaded, the slower the processing and the higher the memory usage. Select only the languages actually present in the document and avoid loading unnecessary models. Typically 2-3 languages is the ideal balance.

04 Accuracy optimization techniques

Regardless of the language used, the following optimization tips will help you significantly improve OCR recognition accuracy.

Image quality is the primary factor affecting recognition. Ensure images are sharp, text edges are crisp, and there's sufficient contrast between background and text. If the original image quality is poor, preprocess it with an image editing tool first.

Text orientation and alignment also matter. Ensure text in the image is horizontal with no visible tilt. Even a small skew angle (2-3 degrees) can significantly impact accuracy. Most image editors provide rotation and correction features.

  • Use images at 300 DPI or higher resolution
  • Ensure high contrast between text and background (dark text on light background is ideal)
  • Crop unnecessary margins and non-text areas
  • Correct skewed images to keep text horizontal
  • Avoid heavily compressed JPEGs (compression artifacts interfere with recognition)
  • For blurry images, try sharpening before recognition

FAQ

How many languages does Tesseract.js support?

Tesseract.js supports over 100 languages, including all major world languages and many regional ones. The most commonly used include English, Simplified/Traditional Chinese, Japanese, Korean, French, Spanish, German, Russian, and Arabic.

How to improve low Chinese OCR accuracy?

Key steps to improve Chinese OCR accuracy: use high-resolution images (at least 300 DPI), ensure you've selected the correct Chinese model (Simplified or Traditional), crop to keep only the text region, and make sure text is neither blurry nor skewed. For mixed Chinese-English text, select both models.

Can OCR recognize multiple languages at once?

Yes. Tesseract.js supports loading multiple language models simultaneously. Select all needed languages in the language picker. However, it's best not to exceed 2-3 languages, or processing speed will decrease and accuracy may drop.

Can OCR handle right-to-left languages like Arabic and Hebrew?

Tesseract.js supports RTL (right-to-left) languages like Arabic and Hebrew. However, due to the cursive nature and directional specifics of these scripts, accuracy may not match Latin-script results. Ensure sufficient image clarity for the best outcome.

How large are language model files? Will they take up a lot of storage?

Most language model files range from 1-15 MB. The English model is around 4 MB, while Chinese models are around 10-15 MB. These files are cached by the browser and won't be re-downloaded. If you need to free up space, clearing your browser cache will remove downloaded models.

🌍

Try the Tool Now

Understanding which language models to select and how to optimize settings for each script system maximizes recognition accuracy.

TOOLS.SURIED.COM