Instruction-Tuned LLMs for Event Extraction

Research on enhancing event extraction using instruction-tuned large language models (LLMs), optimizing annotation guidelines, and improving fine-tuning techniques for NLP tasks.

---
Research conducted at George Mason University (GMU) - Natural Language Processing Lab
Research Advisor: Dr. Ziyu Yao
Authors: Saurabh Srivastava★, Sweta Pati★, Dr. Ziyu Yao  [★ Equal contribution]
Conference Submission: ACL 2025
---

Research Paper & Resources:


TL;DR: We instruction-tune LLMs for event extraction using Python code prompts and machine-generated annotation guidelines — achieving strong generalization across domains, schemas, and model sizes with minimal supervision.


Tech Stack & Tools:

  • NLP and Machine Learning: LLaMA-3.1-8B, LLaMA-3.2-1B, and Qwen2.5-Coder-1.5B, Hugging Face Transformers, PyTorch, LoRA Fine-Tuning, Unsloth, Quantization
  • Data Processing: JSON, Python (Pandas, Numpy)
  • GPU Resources: HPC clusters, CUDA-enabled GPUs
  • Evaluation Metrics: Precision, Recall, F1-score

Context & Motivation:

Event Extraction (EE) is a fundamental task in information extraction but remains challenging due to:

  • Complex schemas
  • Domain shifts
  • Low-resource settings

Our work investigates how instruction-tuned LLMs can benefit from structured guidance in the form of event schemas and machine-generated annotation guidelines.


Approach:

This research explores the enhancement of event extraction (EE) tasks by leveraging instruction-tuned large language models (LLMs). Traditional event extraction models struggle with limited training data, ambiguous event definitions, and scalability. To address these challenges, our approach:

  • Synthesized annotation guidelines using GPT-4o for:
    • 500+ event types
    • 4000+ argument structures
    • Examples include positive, negative, and sibling event types
    • Removes reliance on manually written guidelines
  • Natural Language to Code-Based Prompt Conversion:
    • We provide a modular conversion script that transforms natural language event extraction data (in TextEE format) into structured Python @dataclass prompts with embedded annotation guidelines.
    • Instruction prompts use:
      • We instruction-tune LLMs with structured prompts that represent events as Python @dataclass definitions.
      • Annotation guidelines are integrated directly into prompt docstrings, defining triggers and arguments in natural language.
  • Fine-tuning models:
    LLaMA-3.1 8B, LLaMA-3.2-1B, and Qwen2.5-Coder-1.5B using LoRA and structured regularization

  • Designed custom evaluation framework for EE tasks designed around structured code-like annotations

This work aims to improve F1-score performance on event extraction tasks and enhance the reliability of LLM-generated event predictions.


Experiments:

We conducted evaluations on:

  • Datasets: ACE05 and RichERE
  • Settings: Full-resource and low-resource (2k samples)
  • Variants: With and without contrastive training (negative sampling)

Key Results:

ACE05 (machine-generated guidelines, no negative sampling)

- Trigger Classification (TC): +10% improvement  
- Argument Classification (AC): +5% improvement  

RichERE (machine-generated guidelines, with negative sampling)

- Trigger Classification (TC): +30% improvement  
- Argument Classification (AC): +25% improvement  

Major Contributions

1. Annotation Guideline Optimization

  • Developed a structured framework for creating high-quality event extraction annotation guidelines.
  • Synthesized event schema covering 500+ event types and 4000+ argument structures using GPT-4o.

2. Natural Language to Code-Based Prompt Conversion:

  • Developed a script that translates natural language event definitions into Python @dataclass-style code prompts
  • These prompts include annotation guidelines as docstrings, making them executable and interpretable by LLMs
  • Converts event triggers and arguments into Python-style classes
  • Embeds natural language instructions as docstrings for each event type and role
  • Supports automatic annotation guideline integration using synthesized or provided files
  • Generates LLM-compatible structured prompts for instruction tuning

  • This script is a core part of the pipeline that bridges textual datasets and code-style prompt learning for LLMs like LLaMA-3 and Qwen.

3. LLM Fine-Tuning & Optimization

  • Implemented Low-Rank Adaptation (LoRA) for fine-tuning LLaMA-3.1 8B, LLaMA-3.2-1B, and Qwen2.5-Coder-1.5B on event extraction tasks using unsloth library.
  • Improved model efficiency using structured regularization techniques.

4. Cost-Effective & Low-Latency Inference Pipelines

  • Designed a scalable NLP pipeline using GPU-optimized environments.
  • Designed a custom evaluation framework guided by annotations for each event type for event extraction tasks as code representations.

Performance Highlights:

  • Machine-generated guidelines consistently outperformed human-written ones.
  • The approach generalized well across domains, schema complexities, and model architectures.
  • In low-data settings (2k), models with guidelines matched or exceeded full-data baselines.
  • Frequent and moderately rare event types showed notable improvement, extremely rare types remain a challenge.
  • While guidelines help, we found that machine-generated guidelines consistently outperform human-written ones, and their effectiveness may or may not complement negative sampling strategies.

Future Work & Applications

Expanding dataset coverage to include multi-domain event extraction tasks
Exploring multimodal event extraction by integrating text, images, and video content


For collaboration or inquiries, feel free to reach out via LinkedIn or Email.