Model Card for NatLibFi/Qwen2-0.5B-Instruct-FinGreyLit-GGUF

This is a LLM that is fine-tuned for the specific task of extracting metadata from grey literature PDF documents. It is based on Qwen2-0.5B-Instruct, a relatively small (0.5B) model that is suitable for running locally on a CPU.

The model has been quantized and is provided as a single 400MB GGUF file that can be run using llama.cpp.

Model Details

Model Description

Developed by: National Library of Finland
Model type: Transformer large language model
Language(s) (NLP): Finnish, Swedish, English (others could work but not extensively tested)
License: Apache 2.0
Finetuned from model: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct

Model Sources [optional]

Repository: https://github.com/NatLibFi/FinGreyLit
Demo notebook: https://github.com/NatLibFi/FinGreyLit/blob/main/experiments/llm-inference-api/Extract-metadata-LLM-API.ipynb

Uses

The model is intended for the single purpose of metadata extraction, in particular the development of metadata extraction tools.

Direct Use

The model supports a ChatML template and expects the following "conversation":

  "messages": [
    {"role": "system", "content": "You are a skilled librarian specialized in meticulous cataloguing of digital documents."},
    {"role": "user", "content": "Extract metadata from this document. Return as JSON." + "\n\n" + doc_json},
  ]

where doc_json is a JSON document that contains text and embedded metadata extracted from a PDF file, like this:

{"pdfinfo":{"title":"AI coffee presentation: Extracting metadata using LLMs"},"pages":[{"page":1,"text":"Extracting metadata from grey literature using large language models\nOsma Suominen\n2023-11-01"},{"page":2,"text":"Grey literature? reports working papers government documents white papers preprints theses … semi-formal non-commercial\nPDFs published on the web – lots of them!"},{"page":6,"text":"First 5 pages of text from PDF"},{"page":7,"text":"Example of LLM extracted metadata\nDiff view: human vs. LLM generated"},{"page":8,"text":"What we found out"}]}

The response should be a JSON document that looks something like this:

{"language": "eng", "title": "Extracting metadata using large language models", "creator": ["Suominen, Osma"], "year": "2023", "publisher": ["Yrkesh\u00f6gskolan Novia"], "type_coar": "research article"}

Note that the extracted publisher is incorrect in the above example. Also the coar_type classification is wrong.

The full metadata schema of the response is documented in the FinGreyLit repository.

Out-of-Scope Use

All other uses than metadata extraction are out of scope, although the model seems to retain some of the basic chat capability that the base model has.

Bias, Risks, and Limitations

The model quite often produces inaccurate results. Especially extracted titles for non-English documents are frequently wrong. It is also possible that the output is not a valid JSON document or even if it is, it doesn't necessarily follow the intended schema.

Recommendations

The model is intended for local development of metadata extraction tools, not any kind of production use. It can be more efficient for developers to work with a small, locally running LLM than a larger, better quality LLM that is slower to run and/or needs a GPU.

Using JSON mode, i.e. setting "response_format": {"type": "json_object"}, is recommended when using this model with llama.cpp. This forces the model to output valid JSON.

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

See the demo notebook.

Training Details

Training Data

The model was fine-tuned on 620 training documents from the FinGreyLit data set.

Training Procedure

The model was fine-tuned in the University of Helsinki HPC environment on a single A100 GPU using the Axolotl tool and the LoRA method. See the notebook used for training for details such as hyperparameters.

Evaluation

The model has been evaluated using an evaluation methodology developed specifically for the metadata extraction task. The methodology is still being refined.

The current evaluation code is available.

The overall score for this model is 0.85. Full field-by-field results can be found in the FinGreyLit repository.

Model Card Contact

Please use the FinGreyLit GitHub repository.