Skip to content

Home

Unstructured is an open source service and/or SaaS that efficiently using machine learning extracts your data into usable text and images. It currently handles Plain text files (.txt/.text), PDFs (.pdf), Word Documents (.doc/.docx), PowerPoints (.ppt/.pptx), Images (.jpg/.jpeg), Emails (.eml/.msg), HTML (.html) and Markdown Files (.md).

The Unstructured core module is a simple API module that can be extended by any service.

This project comes with a submodule that can be used together with the AI Automators to take any of these type of files and fill a long formatted text or long plain text field with the structured content.

Features

  • Import txt, pdf, doc, ppt, jpg, eml, html,csv or md into a text field.
  • Output can in plain text, markdown or html.
  • With markdown and html, the images inside the document also gets extracted.
  • Extract tables from Excel, PDFs, Word Files, Images into a TableField.
  • Extract image from PDFs and Images into image fields via AI Automators.

Post-Installation

Visit admin/config/unstructured/settings to setup if you want to connect to your own Unstructured machine or the SaaS. If its the SaaS a api key is required for sure.

DDEV/Self-hosted

Roberto Peruzzo added a DDEV plugin that can be used as starting point to get it working locally. Check out https://github.com/robertoperuzzo/ddev-unstructured for instructions on how to set it up!

Additional Requirements

You need an Unstructured server or an account on the SaaS (or free trial).