Textract
💡 Definition
Amazon Textract is a machine learning service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.
🔑 Key Concepts
- OCR++: Not just raw text, but structure.
- Forms: Identifies key-value pairs (e.g., "Name: John Doe").
- Tables: Preserves the structure of data in tables (rows/columns).
- Handwriting: Can read handwritten text.
⚙️ How it Works
You upload a document (image or PDF). Textract analyzes it and returns the raw text plus the structural information (this text belongs to this field in a form).
🎯 Use Cases
- Digitization: Converting physical records (medical, legal) into searchable data.
- Automated Processing: Extracting data from invoices or receipts for accounting.
- Form Processing: Automating loan applications or insurance claims.
💰 Pricing Model
- Pages: Charged per page processed.
📝 Exam Tips (CLF-C02)
- Keyword: "Extract text and data", "Scanned documents", "Forms and Tables".
- Better than simple OCR because it understands structure.
See Also: * Rekognition * Comprehend