Playing with Microsoft Florence 2
I created a playground space for Microsoft Florence 2 at huggingface. Forence is a new model from Microsoft that can be used for image captioning, segmentation and OCR tasks and it can run on both CPU and GPU. Even though its a smaller model it performs close to SOTA.
This one picture from the paper illustrates this model perfectly
How to use
Just follow the link and upload your image on the left, select a task and click on “Analyze Image”
Some of the tasks will require additional prompt and the UI lets you do that. One of the usecases is it works well for OCR related tasks along with bounding boxes as seen below
Source code
All source is available at
https://github.com/gavi/florence
The Gradio UI is pretty straightforward. The only interesting this is, the version of transformers(at the time of writing the latest one) tries to import flash attention on a Mac M machines and fails. So here is the patch with attribution to the discussion.
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.dynamic_module_utils import get_imports
def fixed_get_imports(filename: str | os.PathLike) -> list[str]:
"""Work around for https://huggingface.co/microsoft/phi-1_5/discussions/72."""
if not str(filename).endswith("/modeling_florence2.py"):
return get_imports(filename)
imports = get_imports(filename)
imports.remove("flash_attn")
return imports
with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)