Playing with Microsoft Florence 2

I created a playground space for Microsoft Florence 2 at huggingface. Forence is a new model from Microsoft that can be used for image captioning, segmentation and OCR tasks and it can run on both CPU and GPU. Even though its a smaller model it performs close to SOTA.

This one picture from the paper illustrates this model perfectly

Florence paper screenshot

How to use

Just follow the link and upload your image on the left, select a task and click on “Analyze Image”

Some of the tasks will require additional prompt and the UI lets you do that. One of the usecases is it works well for OCR related tasks along with bounding boxes as seen below

text with bounding boxes ocr

Source code

All source is available at

https://github.com/gavi/florence

The Gradio UI is pretty straightforward. The only interesting this is, the version of transformers(at the time of writing the latest one) tries to import flash attention on a Mac M machines and fails. So here is the patch with attribution to the discussion.

from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.dynamic_module_utils import get_imports

def fixed_get_imports(filename: str | os.PathLike) -> list[str]:
    """Work around for https://huggingface.co/microsoft/phi-1_5/discussions/72."""
    if not str(filename).endswith("/modeling_florence2.py"):
        return get_imports(filename)
    imports = get_imports(filename)
    imports.remove("flash_attn")
    return imports

with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):
    model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)
    processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)