🌳 PHASE-Tree

Modeling Character-State Evolution in Long-Horizon Role-Playing Dialogue

English | 中文

A psychology-grounded, multi-timescale character-state representation that lets role-playing models speak from the character's current narrative state, not a frozen profile.

_{Figure 1. PHASE-Tree decomposes a character into an immutable identity root and three mutable strata at distinct time-scales — long-term persona, session-level session, and turn-level moment — each editable at the field level under resistance, evidence, and cooldown gates.}

📖 Abstract

Long-horizon role-playing requires a character to remain recognizable while evolving with the narrative, yet existing methods and benchmarks mainly test persona preservation or memory recall rather than whether a model can speak from a character's currently evolved state — a failure mode we call stale-state failure. We introduce PHASE-Tree, a multi-timescale character-state tree with an immutable identity root and mutable persona, session, and moment strata, where each field is an addressable update target supporting localized intra-episode and cross-episode evolution. We further introduce LongEvoRoleBench, an evaluation suite for long-horizon character-state evolution: four long-dialogue corpora form the core cross-episode evolution test, and four short-dialogue corpora provide within-scene state-tracking checks under the same generation format. PHASE-Tree can condition generation via explicit textual provision (serializing the tree into the prompt) or implicit parametric adaptation (encoding it into LoRA weights). On LongEvoRoleBench, textual provision achieves the best score in 21 of 24 internal-ablation dataset–metric cells and improves long-dialogue macro averages over the strongest textual-provision external baseline for each metric by +0.49 Char (+19.6%), +0.41 Sem (+12.4%), and +0.04 Emb (+15.0%).

🧭 At a Glance

PHASE-Tree (Psychology-grounded Hierarchical Attribute-Structured Evolving Tree) is a three-part contribution:

Representation. A four-stratum character-state tree with a fixed identity root and three editable strata (persona / session / moment). Each schema field is an independently addressable update target, supporting intra-episode tracking and cross-episode evolution under resistance–evidence–cooldown gates.
Conditioning paradigms. The same flattened PHASE-Tree state drives generation through two complementary routes — explicit textual provision (serialize the tree into the prompt; primary validated path) and implicit parametric adaptation (encode it into LoRA via a profile-to-adapter hypernetwork; token-efficient deployment variant).
Benchmark — LongEvoRoleBench. A benchmark suite for long-horizon character-state evolution. It standardizes 8 role-playing corpora into a unified next-utterance generation format with random / OOD splits, state-aligned metrics, and baseline scores for both conditioning paradigms.

Textual provision vs. parametric adaptation

_{Figure 2. Two conditioning paradigms for the same flattened PHASE-Tree. Explicit Textual Provision (top, blue): profile lives in the prompt — full state inspectability. Implicit Parametric Adaptation (bottom, red): profile is absorbed into hypernetwork-generated LoRA weights — dialogue-only prompt, zero profile-token overhead.}

📰 News

2026-05 Code, models, data, and full evaluation results released on GitHub + Hugging Face.
2026-05 Paper submitted to EMNLP 2026.

🏆 Highlights

On LongEvoRoleBench with Qwen2.5-7B-Instruct as the backbone. Throughout the README we write Ours (textual) and Ours (parametric) to disambiguate the two conditioning paradigms — both refer to PHASE-Tree, distinguished by which baseline block they sit in (paper Table 1 = internal ablation, Table 2 = external comparison with two Ours columns side by side).

Setting	Result
🏅 Internal ablation (textual)	PHASE-Tree achieves the best score in 21 of 24 dataset–metric cells.
📈 External baselines (long-dialogue macro, textual block)	Ours (textual) improves over the strongest textual-provision baseline for each metric by +0.49 Char (+19.6%) vs PAG, +0.41 Sem (+12.4%) vs RAG, and +0.04 Emb (+15.0%) vs RAG.
💸 Short-dialogue token efficiency	Ours (textual) uses 471 prompt tokens — 43% fewer than CFG, <50% of PAG — while leading on Sem.
🧩 Parametric adaptation	Ours (parametric) leads Sem on both short- and long-dialogue panels (3.748 / 3.434) and ties on long-dialogue Emb (0.283) at zero profile-token overhead.
🔬 Statistical significance	PT vs. NR and PT vs. ST (internal) and Ours-vs-strongest-baseline-in-block (external) are all significant at paired t-test p < 0.001.

Headline numbers (long-dialogue, macro-average over 4 corpora × {random, OOD})

Block	Method	Char ↑	Sem ↑	Emb ↑
—	Base (no profile)	2.326	3.323	0.268
Textual Provision	RAG	2.405	3.289	0.273
Textual Provision	PAG	2.510	2.889	0.255
Textual Provision	CFG	2.389	2.429	0.225
Textual Provision	Ours (textual)	3.003	3.697	0.314
Parametric Adaptation	MT-LoRA	2.269	3.428	0.283
Parametric Adaptation	Activation Steering	2.381	2.350	0.249
Parametric Adaptation	OPPU	2.376	3.141	0.283
Parametric Adaptation	P2P	2.396	3.410	0.276
Parametric Adaptation	Ours (parametric)	2.307	3.434	0.283

Bold cells mark the best method within each paradigm block. See § Reproducing the Paper for the full per-dataset tables.

Cost vs. quality

_{Figure 3. Prompt-token cost (leftmost panel, shorter bar = cheaper) paired with Char / Sem / Emb scores. On short dialogues, Ours (textual) is the cheapest profile-augmented method and still leads on Sem; on long dialogues the prompt grows to accommodate accumulated evolution history but yields the best Sem and Emb of any method in our comparison. All parametric-adaptation methods collapse to context-only cost.}

📁 Repository Structure

PHASE-Tree/
├── preprocessing/         # Per-corpus profile extraction + dialogue conversion
├── src/
│   ├── tree_pipeline/     # Build raw → static → dynamic → PHASE-Tree profiles
│   ├── hyper_llm_modulator/  # Hyper-LoRA modulator (encoder, mixer, output heads)
│   ├── scripts/           # Hypernetwork training entry-points & launchers
│   └── configs/           # Training YAMLs
├── tasks/                 # Per-split metadata YAMLs (consumed by the SFT trainer)
├── evaluation/            # predict_* + judge.py + report.py + visualize.py
├── figures/               # Paper figures
├── .env                   # Placeholder template — fill in your API keys locally
├── requirements.txt       # Core Python dependencies
├── requirements-flash-attn.txt   # Optional FlashAttention-2 (recommended)
└── LICENSE                # MIT

⚠️ Three large directories are not tracked in this repository and must be fetched from Hugging Face on first use: phase_tree_data/, phase_tree_models/, and (optionally) results/.

🤗 Released Resources

Type	Repo	Hugging Face	Default local path	Approx. size
Dataset	`IAAR-Shanghai/phase_tree_data`	🤗 link	`phase_tree_data/`	≈ 8.4 GB
Model	`IAAR-Shanghai/phase_tree_models`	🤗 link	`phase_tree_models/`	≈ 1.7 GB
Results	`Mathematics-Yang/phase_tree_results`	🤗 link	`results/`	≈ 4.4 GB

One-shot download (run from the repo root):

hf download IAAR-Shanghai/phase_tree_data    --repo-type=dataset --local-dir phase_tree_data
hf download IAAR-Shanghai/phase_tree_models                       --local-dir phase_tree_models
# Optional — only if you want to skip re-running predictions/judging:
hf download Mathematics-Yang/phase_tree_results --repo-type=dataset --local-dir results

You also need the two base models on disk (default expected under models/):

hf download Qwen/Qwen2.5-7B-Instruct  --local-dir models/Qwen2.5-7B-Instruct
hf download Qwen/Qwen3-Embedding-4B   --local-dir models/Qwen3-Embedding-4B

🔧 Installation

Tested on Python 3.10 + CUDA 12.x (Linux) with a single A100 / H100 for the textual-provision route and a single A100 80 GB for hypernetwork SFT.

git clone https://github.com/MemTensor/PHASE-Tree.git
cd PHASE-Tree

python -m venv .venv && source .venv/bin/activate

# 1) Core stack (torch, transformers, peft, vllm, openai, ...)
pip install -r requirements.txt

# 2) (Optional, recommended) FlashAttention-2 kernels
#    MUST come AFTER requirements.txt and use --no-build-isolation
pip install -r requirements-flash-attn.txt --no-build-isolation

If FlashAttention cannot build on your machine, the codebase falls back to attn_implementation="sdpa" automatically — you lose some throughput but training and inference still work.

PyTorch / CUDA mismatch? Use the official selector to install a wheel that matches your local driver before requirements.txt.

🔑 API Configuration

All API-dependent steps (profile extraction, persona evolution, LLM-as-Judge scoring, embedding similarity) read credentials from a .env at the repo root via python-dotenv. The shipped .env is a placeholder template:

# Option A — fill placeholders in place (do NOT commit your real keys)
$EDITOR .env

# Option B — keep .env as the public template, override locally (recommended)
cp .env .env.local
$EDITOR .env.local         # `.env.local` is git-ignored

The three model groups can point to the same OpenAI-compatible endpoint, or to different ones (e.g. a local vLLM server for embeddings):

Variable group	Used by
`LLM_*`	`preprocessing/.py`, `src/tree_pipeline/.py` (profile extraction + persona evolution)
`JUDGE_*`	`evaluation/judge.py` (LLM-as-Judge Char + Sem scoring)
`EMBED_*`	`evaluation/judge.py` (response-vs-reference cosine similarity)

🚀 Quick Start

A minimum end-to-end smoke test on one short-dialogue dataset, both routes:

# 0) Fetch data + checkpoint (once)
hf download IAAR-Shanghai/phase_tree_data   --repo-type=dataset --local-dir phase_tree_data
hf download IAAR-Shanghai/phase_tree_models                       --local-dir phase_tree_models

# 1) Textual provision: predict + judge + report on RAIDEN (both splits)
bash evaluation/run_prompt_eval.sh RAIDEN

# 2) Parametric adaptation (hypernetwork → LoRA): predict + judge + report on RAIDEN
bash evaluation/run_phase_tree_eval.sh RAIDEN

# 3) External baseline for context (e.g. RAG)
bash evaluation/run_comparison_eval.sh RAIDEN rag

Each launcher reads phase_tree_data/processed/<DATASET>/<METHOD>/{random_test,ood_test}.json, writes predictions to results/<DATASET>/<paradigm>/main/<METHOD>/<SPLIT>/predictions.jsonl, then chains judge.py → report.py → visualize.py.

🧬 Pipeline

PHASE-Tree is a 4-stage pipeline. Each stage is independently runnable; we ship the outputs of stages 1 and 2 (in phase_tree_data/) and the outputs of stage 3 (in phase_tree_models/) so you can start from any point.

Stage 1 · Per-corpus preprocessing

preprocessing/ contains one profile extractor + one dialogue converter per source corpus. Outputs land under phase_tree_data/processed/<Dataset>/intermediate/.

# Friends (long-dialogue example): seed initial profiles from Season 1
python preprocessing/extract_profiles_friends.py

# Convert Season 1–10 transcripts into next-utterance samples + temporal split
python preprocessing/preprocess_dialogues_friends.py

Short-dialogue corpora (RAIDEN, CharacterEval, SimsConv, ChatHaruhi) use a personality-clustering split; long-dialogue corpora (Friends, HPD, StarTrek_TNG, TheOffice) use a chronological train / OOD-temporal split.

Stage 2 · Build the PHASE-Tree profile variants

Six ablation-chain profile variants are produced under phase_tree_data/processed/<Dataset>/:

Variant	Description	Paper tag
`m1_context_only`	No profile — pure dialogue context.	Base
`m2_raw_profile`	Raw extracted profile text.	RP
`m3_naive_rewrite`	LLM-rewritten profile, no structure.	NR
`m4_static_tree`	Flattened PHASE-Tree, identity + persona only.	ST
`m5_dynamic_tree`	Persona-only evolution across episodes (long-dialogue only).	DT
`m6_phase_tree`	Full PHASE-Tree: identity + evolved persona + session + moment.	PT (Ours)

The long-dialogue evolution orchestrator (pipeline_evolve_full) runs Stage A (evidence accumulation) → Stage B (resistance-gated update) → Stage C (deterministic post-update patches):

# Full run (LLM evolve + all patches)
python -m src.tree_pipeline.pipeline_evolve_full --dataset Friends

# Patches only (skip the slow / expensive LLM evolve step)
python -m src.tree_pipeline.pipeline_evolve_full --dataset Friends --skip_evolve

# Forward args to the inner evolve step (parallelism, single-episode test, ...)
python -m src.tree_pipeline.pipeline_evolve_full --dataset Friends --workers 8 --test_episode S05E10

Individual stages (evolve_persona, decay_stale_romantic, repair_inter_main_reciprocity, align_inverse_pair, forward_fill_continuity, ...) live under src/tree_pipeline/ and are independently runnable; see the docstring at the top of pipeline_evolve_full.py for the canonical order.

Stage 3 · Train the hypernetwork (Implicit Parametric Adaptation route)

The hyper-LoRA modulator is a profile-to-adapter hypernetwork wrapped around Qwen2.5-7B-Instruct. The shipped launcher does a warm-start SFT on all 8 PHASE-Tree training sets:

# Warm-start from the released pretrained hypermod (default INIT_CKPT)
bash src/scripts/train_phase_tree_qwen_7b.sh

# Train from scratch (no warm-start)
INIT_CKPT="" bash src/scripts/train_phase_tree_qwen_7b.sh

# Override hyperparameters
LR=1e-5 EPOCHS=20000 WARMUP=0.1 bash src/scripts/train_phase_tree_qwen_7b.sh

The default config is src/configs/phase_tree_hyper_lora.yaml (lr 5e-6, warmup 0.05, 40 000 steps, hierarchical batch sampler, sqrt-size mixture). Checkpoints land in phase_tree_models/sft/<run>/{hypermod.pt, args.yaml, adapter_config.json} and are loadable by the same checkpoint reader used at inference time.

Stage 4 · Inference

Paradigm	Script	What it does
Textual Provision — Ours (textual)	`evaluation/predict_prompt.py`	Serializes the PHASE-Tree into the prompt; vLLM or HF backend.
Parametric Adaptation — Ours (parametric)	`evaluation/predict_phase_tree.py`	Generates per-character LoRA via the PHASE-Tree SFT hypermod; profile not in prompt.
Parametric Adaptation — P2P (baseline)	`evaluation/predict_hypernet.py`	Same architecture, but with the raw-profile P2P hypermod.
External baselines: RAG / PAG / CFG / Steering / MT-LoRA / OPPU	`evaluation/predict_{rag,cfg,steering,mt_lora,oppu}.py` (PAG is dispatched through `predict_rag.py` with `--profile_data`)	Reference baselines, three under each paradigm.

The recommended path is the per-paradigm launcher, which auto-detects short vs long dialogue, distributes tasks across GPUs, and chains predict → judge → report → visualize:

# Textual Provision — ablation chain (m1 + m2 + m3 + m4 + m6 for short; + m5 for long)
bash evaluation/run_prompt_eval.sh <DATASET>

# Parametric Adaptation — PHASE-Tree SFT hypermod, all methods, both splits, multi-GPU
bash evaluation/run_phase_tree_eval.sh                       # all 8 datasets
bash evaluation/run_phase_tree_eval.sh Friends long-term     # single dataset, explicit mode

# Parametric Adaptation — P2P pretrained baseline
bash evaluation/run_hypernet_p2p_eval.sh <DATASET>

# External baselines
bash evaluation/run_comparison_eval.sh <DATASET> rag         # also: pag, cfg, steering, mt_lora, oppu

Evaluation metrics

Once predictions.jsonl exists, evaluation/judge.py writes two scoring files (full resume support):

File	Range	What it measures
`judge_scores.jsonl`	1 – 5	`character_score` (profile consistency) + `semantic_score` (contextual coherence). GPT-4.1 LLM-as-Judge with the rubric in `evaluation/persona_rubric.md`.
`embedding_scores.jsonl`	[-1,1]	Cosine similarity of `text-embedding-3-small` embeddings of prediction vs reference.

Aggregation + figures:

python evaluation/report.py    --results_dir results/RAIDEN/prompt/main --baseline m2_raw_profile --per_character
python evaluation/visualize.py --results_dir results/RAIDEN/prompt/main --format pdf
python evaluation/autoreport.py                                          # cross-dataset roll-up

A reference-side ablation (re-judge with m2_raw_profile persona reference instead of the full PHASE-Tree) is in evaluation/run_ablation.sh.

📊 LongEvoRoleBench

Eight role-playing corpora standardized into a common next-utterance generation format with paired random and OOD test regimes:

Dataset	Lang	Pipeline	# Main characters	OOD axis
CharacterEval	ZH	short-term	77	unseen-character cluster
ChatHaruhi	EN+ZH	short-term	31	unseen-character cluster
RAIDEN	ZH	short-term	30	unseen-character cluster
SimsConv	EN	short-term	68	unseen-character cluster
Friends	EN	long-term	6	later-season temporal holdout
HPD (Harry Potter Dialogue)	EN	long-term	6	later-book temporal holdout
StarTrek_TNG	EN	long-term	6	later-season temporal holdout
TheOffice	EN	long-term	6	later-season temporal holdout

Each tasks/<name>/metadata.yaml registers a split with the hypernetwork SFT trainer; the same JSONs are consumed directly by the prediction scripts. Generation and scoring both condition on the same time-t character state, so Character Score measures the current state rather than a frozen profile.

🧪 Reproducing the Paper

To regenerate every number in the paper:

# 0) Make sure data + the recommended SFT checkpoint are on disk
hf download IAAR-Shanghai/phase_tree_data   --repo-type=dataset --local-dir phase_tree_data
hf download IAAR-Shanghai/phase_tree_models                       --local-dir phase_tree_models

# 1) All textual-provision ablations across the 8 corpora
for D in RAIDEN CharacterEval HPD SimsConv ChatHaruhi Friends StarTrek_TNG TheOffice; do
    bash evaluation/run_prompt_eval.sh "$D"
done

# 2) All parametric-adaptation ablations
bash evaluation/run_phase_tree_eval.sh        # PHASE-Tree SFT hypermod — Ours (parametric)
bash evaluation/run_hypernet_p2p_eval.sh      # P2P baseline

# 3) External baselines (RAG, PAG, CFG, Steering, MT-LoRA, OPPU)
for D in RAIDEN CharacterEval HPD SimsConv ChatHaruhi Friends StarTrek_TNG TheOffice; do
    for M in rag pag cfg steering mt_lora oppu; do
        bash evaluation/run_comparison_eval.sh "$D" "$M"
    done
done

# 4) Roll everything up into the paper-style summary tables
bash evaluation/run_autoreport.sh

Token-pareto figures (Fig. 3 in the paper) are produced by evaluation/make_token_figures.py once summary.json files exist under each results/<DATASET>/<paradigm>/main/.

If you only want to inspect the numbers we report, download the precomputed results bundle instead:

hf download Mathematics-Yang/phase_tree_results --repo-type=dataset --local-dir results
bash evaluation/run_autoreport.sh

⚠️ Limitations

As discussed in the paper (§ Limitations), known caveats include:

Single backbone, single run, limited corpus coverage. All experiments use Qwen2.5-7B-Instruct under one decoding configuration and seed, without cross-family, cross-scale, multi-seed, or multilingual verification. The long-dialogue suite covers four scripted English-language corpora; broader genres and spontaneous dialogue are untested.
LLM-based evaluation and extraction. Char and Sem rely on a GPT-4.1 judge with no human-agreement study; state extraction and update judgment are also LLM-driven without gold-annotation auditing. Significance tests are question-level paired on a single run and do not reflect run-to-run variability.
Hand-tuned gating and parametric adaptation. Resistance, evidence, and cooldown thresholds are implementation defaults rather than swept or learned; the implicit parametric adaptation variant follows a P2P-style paradigm without fully exploiting the multi-timescale state.

🙏 Acknowledgements

The hyper-LoRA modulator architecture builds on the P2P codebase [Tan et al., 2025]. The eight evaluation corpora are derived from prior public releases — RAIDEN, CharacterEval, SimsConv, ChatHaruhi, HPD, the ConvoKit Friends Corpus, and public episode transcripts for The Office and Star Trek: TNG — and we thank the original authors and maintainers for making these resources available. The generation backbone (Qwen2.5-7B-Instruct) and embedding encoder (Qwen3-Embedding-4B) are from the Qwen team.

📄 License

Code in this repository is released under the MIT License.
Released model checkpoints and evaluation results on Hugging Face are released under CC-BY-NC-4.0 — see the model / dataset cards on Hugging Face for details.
The underlying dialogue corpora retain their original source licenses; please consult each source dataset for redistribution terms.

_{⭐ If this work is useful to you, please consider starring the repository — it helps others find it.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌳 PHASE-Tree

Modeling Character-State Evolution in Long-Horizon Role-Playing Dialogue

📖 Abstract

🧭 At a Glance

📰 News

🏆 Highlights

Headline numbers (long-dialogue, macro-average over 4 corpora × {random, OOD})

Cost vs. quality

📁 Repository Structure

🤗 Released Resources

🔧 Installation

🔑 API Configuration

🚀 Quick Start

🧬 Pipeline

Stage 1 · Per-corpus preprocessing

Stage 2 · Build the PHASE-Tree profile variants

Stage 3 · Train the hypernetwork (Implicit Parametric Adaptation route)

Stage 4 · Inference

Evaluation metrics

📊 LongEvoRoleBench

🧪 Reproducing the Paper

⚠️ Limitations

🙏 Acknowledgements

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
evaluation		evaluation
figures		figures
preprocessing		preprocessing
src		src
tasks		tasks
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
requirements-flash-attn.txt		requirements-flash-attn.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🌳 PHASE-Tree

Modeling Character-State Evolution in Long-Horizon Role-Playing Dialogue

📖 Abstract

🧭 At a Glance

📰 News

🏆 Highlights

Headline numbers (long-dialogue, macro-average over 4 corpora × {random, OOD})

Cost vs. quality

📁 Repository Structure

🤗 Released Resources

🔧 Installation

🔑 API Configuration

🚀 Quick Start

🧬 Pipeline

Stage 1 · Per-corpus preprocessing

Stage 2 · Build the PHASE-Tree profile variants

Stage 3 · Train the hypernetwork (Implicit Parametric Adaptation route)

Stage 4 · Inference

Evaluation metrics

📊 LongEvoRoleBench

🧪 Reproducing the Paper

⚠️ Limitations

🙏 Acknowledgements

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages