English | δΈζ
A psychology-grounded, multi-timescale character-state representation that lets role-playing models speak from the character's current narrative state, not a frozen profile.
Figure 1. PHASE-Tree decomposes a character into an immutable identity root and three mutable strata at distinct time-scales β long-term persona, session-level session, and turn-level moment β each editable at the field level under resistance, evidence, and cooldown gates.
Long-horizon role-playing requires a character to remain recognizable while evolving with the narrative, yet existing methods and benchmarks mainly test persona preservation or memory recall rather than whether a model can speak from a character's currently evolved state β a failure mode we call stale-state failure. We introduce PHASE-Tree, a multi-timescale character-state tree with an immutable identity root and mutable
persona,session, andmomentstrata, where each field is an addressable update target supporting localized intra-episode and cross-episode evolution. We further introduce LongEvoRoleBench, an evaluation suite for long-horizon character-state evolution: four long-dialogue corpora form the core cross-episode evolution test, and four short-dialogue corpora provide within-scene state-tracking checks under the same generation format. PHASE-Tree can condition generation via explicit textual provision (serializing the tree into the prompt) or implicit parametric adaptation (encoding it into LoRA weights). On LongEvoRoleBench, textual provision achieves the best score in 21 of 24 internal-ablation datasetβmetric cells and improves long-dialogue macro averages over the strongest textual-provision external baseline for each metric by +0.49 Char (+19.6%), +0.41 Sem (+12.4%), and +0.04 Emb (+15.0%).
PHASE-Tree (Psychology-grounded Hierarchical Attribute-Structured Evolving Tree) is a three-part contribution:
- Representation. A four-stratum character-state tree with a fixed identity root and three editable strata (
persona / session / moment). Each schema field is an independently addressable update target, supporting intra-episode tracking and cross-episode evolution under resistanceβevidenceβcooldown gates. - Conditioning paradigms. The same flattened PHASE-Tree state drives generation through two complementary routes β explicit textual provision (serialize the tree into the prompt; primary validated path) and implicit parametric adaptation (encode it into LoRA via a profile-to-adapter hypernetwork; token-efficient deployment variant).
- Benchmark β LongEvoRoleBench. A benchmark suite for long-horizon character-state evolution. It standardizes 8 role-playing corpora into a unified next-utterance generation format with random / OOD splits, state-aligned metrics, and baseline scores for both conditioning paradigms.
Figure 2. Two conditioning paradigms for the same flattened PHASE-Tree. Explicit Textual Provision (top, blue): profile lives in the prompt β full state inspectability. Implicit Parametric Adaptation (bottom, red): profile is absorbed into hypernetwork-generated LoRA weights β dialogue-only prompt, zero profile-token overhead.
- 2026-05 Β Code, models, data, and full evaluation results released on GitHub + Hugging Face.
- 2026-05 Β Paper submitted to EMNLP 2026.
On LongEvoRoleBench with Qwen2.5-7B-Instruct as the backbone. Throughout the README we write Ours (textual) and Ours (parametric) to disambiguate the two conditioning paradigms β both refer to PHASE-Tree, distinguished by which baseline block they sit in (paper Table 1 = internal ablation, Table 2 = external comparison with two Ours columns side by side).
| Setting | Result |
|---|---|
| π Internal ablation (textual) | PHASE-Tree achieves the best score in 21 of 24 datasetβmetric cells. |
| π External baselines (long-dialogue macro, textual block) | Ours (textual) improves over the strongest textual-provision baseline for each metric by +0.49 Char (+19.6%) vs PAG, +0.41 Sem (+12.4%) vs RAG, and +0.04 Emb (+15.0%) vs RAG. |
| πΈ Short-dialogue token efficiency | Ours (textual) uses 471 prompt tokens β 43% fewer than CFG, <50% of PAG β while leading on Sem. |
| π§© Parametric adaptation | Ours (parametric) leads Sem on both short- and long-dialogue panels (3.748 / 3.434) and ties on long-dialogue Emb (0.283) at zero profile-token overhead. |
| π¬ Statistical significance | PT vs. NR and PT vs. ST (internal) and Ours-vs-strongest-baseline-in-block (external) are all significant at paired t-test p < 0.001. |
| Block | Method | Char β | Sem β | Emb β |
|---|---|---|---|---|
| β | Base (no profile) | 2.326 | 3.323 | 0.268 |
| Textual Provision | RAG | 2.405 | 3.289 | 0.273 |
| Textual Provision | PAG | 2.510 | 2.889 | 0.255 |
| Textual Provision | CFG | 2.389 | 2.429 | 0.225 |
| Textual Provision | Ours (textual) | 3.003 | 3.697 | 0.314 |
| Parametric Adaptation | MT-LoRA | 2.269 | 3.428 | 0.283 |
| Parametric Adaptation | Activation Steering | 2.381 | 2.350 | 0.249 |
| Parametric Adaptation | OPPU | 2.376 | 3.141 | 0.283 |
| Parametric Adaptation | P2P | 2.396 | 3.410 | 0.276 |
| Parametric Adaptation | Ours (parametric) | 2.307 | 3.434 | 0.283 |
Bold cells mark the best method within each paradigm block. See Β§ Reproducing the Paper for the full per-dataset tables.
Figure 3. Prompt-token cost (leftmost panel, shorter bar = cheaper) paired with Char / Sem / Emb scores. On short dialogues, Ours (textual) is the cheapest profile-augmented method and still leads on Sem; on long dialogues the prompt grows to accommodate accumulated evolution history but yields the best Sem and Emb of any method in our comparison. All parametric-adaptation methods collapse to context-only cost.
PHASE-Tree/
βββ preprocessing/ # Per-corpus profile extraction + dialogue conversion
βββ src/
β βββ tree_pipeline/ # Build raw β static β dynamic β PHASE-Tree profiles
β βββ hyper_llm_modulator/ # Hyper-LoRA modulator (encoder, mixer, output heads)
β βββ scripts/ # Hypernetwork training entry-points & launchers
β βββ configs/ # Training YAMLs
βββ tasks/ # Per-split metadata YAMLs (consumed by the SFT trainer)
βββ evaluation/ # predict_* + judge.py + report.py + visualize.py
βββ figures/ # Paper figures
βββ .env # Placeholder template β fill in your API keys locally
βββ requirements.txt # Core Python dependencies
βββ requirements-flash-attn.txt # Optional FlashAttention-2 (recommended)
βββ LICENSE # MIT
β οΈ Three large directories are not tracked in this repository and must be fetched from Hugging Face on first use:phase_tree_data/,phase_tree_models/, and (optionally)results/.
| Type | Repo | Hugging Face | Default local path | Approx. size |
|---|---|---|---|---|
| Dataset | IAAR-Shanghai/phase_tree_data |
π€ link | phase_tree_data/ |
β 8.4 GB |
| Model | IAAR-Shanghai/phase_tree_models |
π€ link | phase_tree_models/ |
β 1.7 GB |
| Results | Mathematics-Yang/phase_tree_results |
π€ link | results/ |
β 4.4 GB |
One-shot download (run from the repo root):
hf download IAAR-Shanghai/phase_tree_data --repo-type=dataset --local-dir phase_tree_data
hf download IAAR-Shanghai/phase_tree_models --local-dir phase_tree_models
# Optional β only if you want to skip re-running predictions/judging:
hf download Mathematics-Yang/phase_tree_results --repo-type=dataset --local-dir resultsYou also need the two base models on disk (default expected under models/):
hf download Qwen/Qwen2.5-7B-Instruct --local-dir models/Qwen2.5-7B-Instruct
hf download Qwen/Qwen3-Embedding-4B --local-dir models/Qwen3-Embedding-4BTested on Python 3.10 + CUDA 12.x (Linux) with a single A100 / H100 for the textual-provision route and a single A100 80 GB for hypernetwork SFT.
git clone https://github.com/MemTensor/PHASE-Tree.git
cd PHASE-Tree
python -m venv .venv && source .venv/bin/activate
# 1) Core stack (torch, transformers, peft, vllm, openai, ...)
pip install -r requirements.txt
# 2) (Optional, recommended) FlashAttention-2 kernels
# MUST come AFTER requirements.txt and use --no-build-isolation
pip install -r requirements-flash-attn.txt --no-build-isolationIf FlashAttention cannot build on your machine, the codebase falls back to attn_implementation="sdpa" automatically β you lose some throughput but training and inference still work.
PyTorch / CUDA mismatch? Use the official selector to install a wheel that matches your local driver before
requirements.txt.
All API-dependent steps (profile extraction, persona evolution, LLM-as-Judge scoring, embedding similarity) read credentials from a .env at the repo root via python-dotenv. The shipped .env is a placeholder template:
# Option A β fill placeholders in place (do NOT commit your real keys)
$EDITOR .env
# Option B β keep .env as the public template, override locally (recommended)
cp .env .env.local
$EDITOR .env.local # `.env.local` is git-ignoredThe three model groups can point to the same OpenAI-compatible endpoint, or to different ones (e.g. a local vLLM server for embeddings):
| Variable group | Used by |
|---|---|
LLM_* |
preprocessing/*.py, src/tree_pipeline/*.py (profile extraction + persona evolution) |
JUDGE_* |
evaluation/judge.py (LLM-as-Judge Char + Sem scoring) |
EMBED_* |
evaluation/judge.py (response-vs-reference cosine similarity) |
A minimum end-to-end smoke test on one short-dialogue dataset, both routes:
# 0) Fetch data + checkpoint (once)
hf download IAAR-Shanghai/phase_tree_data --repo-type=dataset --local-dir phase_tree_data
hf download IAAR-Shanghai/phase_tree_models --local-dir phase_tree_models
# 1) Textual provision: predict + judge + report on RAIDEN (both splits)
bash evaluation/run_prompt_eval.sh RAIDEN
# 2) Parametric adaptation (hypernetwork β LoRA): predict + judge + report on RAIDEN
bash evaluation/run_phase_tree_eval.sh RAIDEN
# 3) External baseline for context (e.g. RAG)
bash evaluation/run_comparison_eval.sh RAIDEN ragEach launcher reads phase_tree_data/processed/<DATASET>/<METHOD>/{random_test,ood_test}.json, writes predictions to results/<DATASET>/<paradigm>/main/<METHOD>/<SPLIT>/predictions.jsonl, then chains judge.py β report.py β visualize.py.
PHASE-Tree is a 4-stage pipeline. Each stage is independently runnable; we ship the outputs of stages 1 and 2 (in phase_tree_data/) and the outputs of stage 3 (in phase_tree_models/) so you can start from any point.
preprocessing/ contains one profile extractor + one dialogue converter per source corpus. Outputs land under phase_tree_data/processed/<Dataset>/intermediate/.
# Friends (long-dialogue example): seed initial profiles from Season 1
python preprocessing/extract_profiles_friends.py
# Convert Season 1β10 transcripts into next-utterance samples + temporal split
python preprocessing/preprocess_dialogues_friends.pyShort-dialogue corpora (RAIDEN, CharacterEval, SimsConv, ChatHaruhi) use a personality-clustering split; long-dialogue corpora (Friends, HPD, StarTrek_TNG, TheOffice) use a chronological train / OOD-temporal split.
Six ablation-chain profile variants are produced under phase_tree_data/processed/<Dataset>/:
| Variant | Description | Paper tag |
|---|---|---|
m1_context_only |
No profile β pure dialogue context. | Base |
m2_raw_profile |
Raw extracted profile text. | RP |
m3_naive_rewrite |
LLM-rewritten profile, no structure. | NR |
m4_static_tree |
Flattened PHASE-Tree, identity + persona only. | ST |
m5_dynamic_tree |
Persona-only evolution across episodes (long-dialogue only). | DT |
m6_phase_tree |
Full PHASE-Tree: identity + evolved persona + session + moment. | PT (Ours) |
The long-dialogue evolution orchestrator (pipeline_evolve_full) runs Stage A (evidence accumulation) β Stage B (resistance-gated update) β Stage C (deterministic post-update patches):
# Full run (LLM evolve + all patches)
python -m src.tree_pipeline.pipeline_evolve_full --dataset Friends
# Patches only (skip the slow / expensive LLM evolve step)
python -m src.tree_pipeline.pipeline_evolve_full --dataset Friends --skip_evolve
# Forward args to the inner evolve step (parallelism, single-episode test, ...)
python -m src.tree_pipeline.pipeline_evolve_full --dataset Friends --workers 8 --test_episode S05E10Individual stages (evolve_persona, decay_stale_romantic, repair_inter_main_reciprocity, align_inverse_pair, forward_fill_continuity, ...) live under src/tree_pipeline/ and are independently runnable; see the docstring at the top of pipeline_evolve_full.py for the canonical order.
The hyper-LoRA modulator is a profile-to-adapter hypernetwork wrapped around Qwen2.5-7B-Instruct. The shipped launcher does a warm-start SFT on all 8 PHASE-Tree training sets:
# Warm-start from the released pretrained hypermod (default INIT_CKPT)
bash src/scripts/train_phase_tree_qwen_7b.sh
# Train from scratch (no warm-start)
INIT_CKPT="" bash src/scripts/train_phase_tree_qwen_7b.sh
# Override hyperparameters
LR=1e-5 EPOCHS=20000 WARMUP=0.1 bash src/scripts/train_phase_tree_qwen_7b.shThe default config is src/configs/phase_tree_hyper_lora.yaml (lr 5e-6, warmup 0.05, 40 000 steps, hierarchical batch sampler, sqrt-size mixture). Checkpoints land in phase_tree_models/sft/<run>/{hypermod.pt, args.yaml, adapter_config.json} and are loadable by the same checkpoint reader used at inference time.
| Paradigm | Script | What it does |
|---|---|---|
| Textual Provision β Ours (textual) | evaluation/predict_prompt.py |
Serializes the PHASE-Tree into the prompt; vLLM or HF backend. |
| Parametric Adaptation β Ours (parametric) | evaluation/predict_phase_tree.py |
Generates per-character LoRA via the PHASE-Tree SFT hypermod; profile not in prompt. |
| Parametric Adaptation β P2P (baseline) | evaluation/predict_hypernet.py |
Same architecture, but with the raw-profile P2P hypermod. |
| External baselines: RAG / PAG / CFG / Steering / MT-LoRA / OPPU | evaluation/predict_{rag,cfg,steering,mt_lora,oppu}.py (PAG is dispatched through predict_rag.py with --profile_data) |
Reference baselines, three under each paradigm. |
The recommended path is the per-paradigm launcher, which auto-detects short vs long dialogue, distributes tasks across GPUs, and chains predict β judge β report β visualize:
# Textual Provision β ablation chain (m1 + m2 + m3 + m4 + m6 for short; + m5 for long)
bash evaluation/run_prompt_eval.sh <DATASET>
# Parametric Adaptation β PHASE-Tree SFT hypermod, all methods, both splits, multi-GPU
bash evaluation/run_phase_tree_eval.sh # all 8 datasets
bash evaluation/run_phase_tree_eval.sh Friends long-term # single dataset, explicit mode
# Parametric Adaptation β P2P pretrained baseline
bash evaluation/run_hypernet_p2p_eval.sh <DATASET>
# External baselines
bash evaluation/run_comparison_eval.sh <DATASET> rag # also: pag, cfg, steering, mt_lora, oppuOnce predictions.jsonl exists, evaluation/judge.py writes two scoring files (full resume support):
| File | Range | What it measures |
|---|---|---|
judge_scores.jsonl |
1 β 5 | character_score (profile consistency) + semantic_score (contextual coherence). GPT-4.1 LLM-as-Judge with the rubric in evaluation/persona_rubric.md. |
embedding_scores.jsonl |
[-1,1] | Cosine similarity of text-embedding-3-small embeddings of prediction vs reference. |
Aggregation + figures:
python evaluation/report.py --results_dir results/RAIDEN/prompt/main --baseline m2_raw_profile --per_character
python evaluation/visualize.py --results_dir results/RAIDEN/prompt/main --format pdf
python evaluation/autoreport.py # cross-dataset roll-upA reference-side ablation (re-judge with m2_raw_profile persona reference instead of the full PHASE-Tree) is in evaluation/run_ablation.sh.
Eight role-playing corpora standardized into a common next-utterance generation format with paired random and OOD test regimes:
| Dataset | Lang | Pipeline | # Main characters | OOD axis |
|---|---|---|---|---|
| CharacterEval | ZH | short-term | 77 | unseen-character cluster |
| ChatHaruhi | EN+ZH | short-term | 31 | unseen-character cluster |
| RAIDEN | ZH | short-term | 30 | unseen-character cluster |
| SimsConv | EN | short-term | 68 | unseen-character cluster |
| Friends | EN | long-term | 6 | later-season temporal holdout |
| HPD (Harry Potter Dialogue) | EN | long-term | 6 | later-book temporal holdout |
| StarTrek_TNG | EN | long-term | 6 | later-season temporal holdout |
| TheOffice | EN | long-term | 6 | later-season temporal holdout |
Each tasks/<name>/metadata.yaml registers a split with the hypernetwork SFT trainer; the same JSONs are consumed directly by the prediction scripts. Generation and scoring both condition on the same time-t character state, so Character Score measures the current state rather than a frozen profile.
To regenerate every number in the paper:
# 0) Make sure data + the recommended SFT checkpoint are on disk
hf download IAAR-Shanghai/phase_tree_data --repo-type=dataset --local-dir phase_tree_data
hf download IAAR-Shanghai/phase_tree_models --local-dir phase_tree_models
# 1) All textual-provision ablations across the 8 corpora
for D in RAIDEN CharacterEval HPD SimsConv ChatHaruhi Friends StarTrek_TNG TheOffice; do
bash evaluation/run_prompt_eval.sh "$D"
done
# 2) All parametric-adaptation ablations
bash evaluation/run_phase_tree_eval.sh # PHASE-Tree SFT hypermod β Ours (parametric)
bash evaluation/run_hypernet_p2p_eval.sh # P2P baseline
# 3) External baselines (RAG, PAG, CFG, Steering, MT-LoRA, OPPU)
for D in RAIDEN CharacterEval HPD SimsConv ChatHaruhi Friends StarTrek_TNG TheOffice; do
for M in rag pag cfg steering mt_lora oppu; do
bash evaluation/run_comparison_eval.sh "$D" "$M"
done
done
# 4) Roll everything up into the paper-style summary tables
bash evaluation/run_autoreport.shToken-pareto figures (Fig. 3 in the paper) are produced by evaluation/make_token_figures.py once summary.json files exist under each results/<DATASET>/<paradigm>/main/.
If you only want to inspect the numbers we report, download the precomputed results bundle instead:
hf download Mathematics-Yang/phase_tree_results --repo-type=dataset --local-dir results
bash evaluation/run_autoreport.shAs discussed in the paper (Β§ Limitations), known caveats include:
- Single backbone, single run, limited corpus coverage. All experiments use
Qwen2.5-7B-Instructunder one decoding configuration and seed, without cross-family, cross-scale, multi-seed, or multilingual verification. The long-dialogue suite covers four scripted English-language corpora; broader genres and spontaneous dialogue are untested. - LLM-based evaluation and extraction. Char and Sem rely on a GPT-4.1 judge with no human-agreement study; state extraction and update judgment are also LLM-driven without gold-annotation auditing. Significance tests are question-level paired on a single run and do not reflect run-to-run variability.
- Hand-tuned gating and parametric adaptation. Resistance, evidence, and cooldown thresholds are implementation defaults rather than swept or learned; the implicit parametric adaptation variant follows a P2P-style paradigm without fully exploiting the multi-timescale state.
The hyper-LoRA modulator architecture builds on the P2P codebase [Tan et al., 2025]. The eight evaluation corpora are derived from prior public releases β RAIDEN, CharacterEval, SimsConv, ChatHaruhi, HPD, the ConvoKit Friends Corpus, and public episode transcripts for The Office and Star Trek: TNG β and we thank the original authors and maintainers for making these resources available. The generation backbone (Qwen2.5-7B-Instruct) and embedding encoder (Qwen3-Embedding-4B) are from the Qwen team.
- Code in this repository is released under the MIT License.
- Released model checkpoints and evaluation results on Hugging Face are released under CC-BY-NC-4.0 β see the model / dataset cards on Hugging Face for details.
- The underlying dialogue corpora retain their original source licenses; please consult each source dataset for redistribution terms.