Weekly Experiment Log: Testing Claude Cowork for Automated Contract Summaries
ExperimentAIDocumentation

Weekly Experiment Log: Testing Claude Cowork for Automated Contract Summaries

UUnknown
2026-03-07
10 min read
Advertisement

Serial experiment log: prompts, safety checks, and outcomes using Claude Cowork to automate contract summaries for SMB ops.

Cut the contract chaos: how Claude Cowork cut a week's work into 90 minutes

Pain point: ops teams drown in contracts, context switching, and manual clause extraction. This weekly experiment log shows how we tested Anthropic's Claude Cowork to produce reliable, auditable contract summaries for SMB operations — with the exact setups, prompts, safety checks, and measurable outcomes you can reuse.

Executive summary (what mattered most)

Across six weekly experiments in late 2025–early 2026 we refined a reproducible pipeline that turns raw contracts into: (1) executive one-pagers, (2) clause-level extract tables, and (3) a risk-tier score for triage. The best setup reduced manual review time by ~78% on average while maintaining a verifiable fidelity workflow (human-in-loop verification, provenance links, and automatic redaction). Key wins: prompt templates, strict safety checks, and a mandatory two-step human validation for high-risk clauses.

  • Document AI matured in 2025: agent-based workspaces and encrypted file agents became mainstream for business document workflows.
  • SMB ops in 2026 are consolidating tool stacks — document agents that can live inside your storage, not replace it, win.
  • Regulatory focus (GDPR, CCPA updates, sector-specific rules) makes provenance and redaction non-negotiable for automated contract processing.
"Backups and restraint are nonnegotiable." — a reminder from early agent deployments in 2025.

Experimental framework

We structured the work as weekly experiments with a repeated pipeline so results are comparable. Each week we logged:

  • Objective (what summary we wanted)
  • Dataset (contract types, volumes)
  • Claude Cowork setup (workspace, permissions)
  • Prompt templates and temperature/settings
  • Safety controls (redaction, retention, human checks)
  • Metrics (time saved, error rate, hallucination incidents)
  • Outcome and actionable adjustments

Week-by-week log: setups, prompts, safety checks, outcomes

Week 1 — Baseline: Extractive executive summaries

Objective: Produce a 1-paragraph executive summary plus party, term, renewal, and termination highlights.

Dataset: 20 vendor contracts (2–12 pages), typical SMB tech services agreements.

Setup: Create a Claude Cowork workspace with read-only access to a secure S3 bucket. Enable activity logging and disable internet browsing in the agent session. Configure output persistence for 7 days.

Prompt (template):

System: You are a document analyst agent. For each provided contract, return (A) a one-sentence executive summary, (B) extracted fields: parties, effective date, term, automatic renewal clause, termination notice, indemnity summary, data processing clause location (page/paragraph). Use the source text as evidence and return page pointers. Do not invent facts.

Safety checks: Mandatory redaction step prior to upload (PII removed), and human spot-check of 3 summaries for fidelity.

Outcome: Time to first pass dropped from ~45 minutes per contract to 8–12 minutes including human review. Hallucinations: 2/20 (incorrect effective dates). Fix: instruct agent to quote the source line for dates.

Week 2 — Add provenance and citations

Objective: Reduce hallucinations by forcing quote-backed outputs.

Setup: Same workspace. Added structured output JSON schema requirement and a ‘source_excerpt’ field for each extracted item.

Prompt change: Require the agent to return the verbatim sentence (up to 250 chars) from the contract that supports each extracted field.

Safety checks: Automated verifier checks that every extracted date/clause has a non-empty source_excerpt, otherwise mark as 'needs review'.

Outcome: Hallucinations fell to 0/25 contracts. Slight increase in processing time (+1–2 minutes) but greatly improved trust. This established the quote-backed extraction pattern going forward.

Week 3 — Risk scoring and triage

Objective: Create a 3-tier risk score (Low/Medium/High) for quick triage of contracts that need legal review.

Prompt snippet:

For each contract, score risk from 1–10 across: termination exposure, auto-renewal risk, liability cap, indemnity breadth, data-processing risk. Map 1–3 to Low, 4–7 to Medium, 8–10 to High. Provide bullets with the supporting clause quotes.

Safety checks: Flag any 'High' items for mandatory human review. Preserve the original clause location (page/paragraph).

Outcome: 6/30 contracts flagged High. Legal confirmed 5/6 as genuine concerns. ROI: reduced time to prioritize real legal spend — legal hours saved by focusing on true risk items.

Week 4 — Multidocument consistency (Master Services Agreements + SOWs)

Objective: Ensure cross-document references are captured: governing law, master agreement references, amendment clauses across paired MSA + SOW files.

Setup: Feed document pairs into Claude Cowork as a single job with explicit pairing metadata. Request a crosswalk table that lists which SOW clauses rely on MSA clauses.

Prompt pattern: Ask the agent to produce a table: Clause (SOW ref), Source doc & page, Linked MSA clause & page, Notes on conflict.

Safety checks: Use an automated diff to check for conflicting dates or terms between paired files; if conflicts detected, escalate.

Outcome: Detected 3 SOWs with conflicting termination windows — all confirmed. This capability shortened contract onboarding from days to hours.

Week 5 — Abstractive summary with style constraints

Objective: Produce customer-facing one-pagers with plain-language summaries.

Prompt constraints: 120–150 word summary, reading grade 8 max, avoid legalese, include three bullets: key dates, customer obligations, and three red flags. Require a 'confidence' score (0–100) for each bullet.

Safety checks: Human readability assessment and a legal pass for disclosures. If confidence <70 for any bullet, mark for human editing.

Outcome: Customer-ready summaries reduced support handoffs by 40% but required human polish for low-confidence clauses. We learned to set conservative confidence thresholds for external-facing content.

Week 6 — Stress test: large, legacy PDF bundle

Objective: Test reliability on 200-page legacy vendor bundle (scanned PDFs + OCR noise).

Setup: Preprocess with OCR cleanup (automated regex cleaning for common misreads), chunking into 10–20 page units, and reassembly of outputs with provenance mapping.

Safety checks: OCR confidence threshold: if OCR confidence <85% for a chunk, mark for manual OCR correction. Add random sampling 10% check for hallucination.

Outcome: Initial hallucination rate rose due to OCR errors. With preprocessing and chunking, we achieved acceptable performance. Key learning: Claude Cowork performs best when upstream OCR quality is ensured and chunk sizes stay under 20 pages.

Prompt engineering patterns that mattered

Success was driven by structured prompts and constraints. Use these proven patterns:

  • System role clarity: Define the agent as a document analyst that must not invent facts.
  • Structured outputs: Require JSON-like outputs with fields (summary, extracted_fields, source_excerpt, page_ref, confidence_score).
  • Quote-backed extraction: Force the model to identify the verbatim supporting text for every claim.
  • Low temperature + instruction hierarchy: Use deterministic settings (low temperature) for extractive tasks and allow higher creativity for plain-language rewrites only.
  • Fail-fast checks: If a required field is missing or has no source excerpt, return a 'needs_review' flag.

Safety, compliance, and governance checklist

Automated contract processing introduces risk. Build these guardrails before scaling:

  1. Access controls: Role-based access for workspaces; ephemeral credentials; SSO + MFA for human reviewers.
  2. Data minimization: Redact PII before upload. Keep raw documents in immutable backups only.
  3. Provenance: Maintain source pointers (file name, page, paragraph) for every extracted claim.
  4. Retention & wipe policies: Auto-delete intermediate agent outputs after a policy window unless flagged for audit.
  5. Audit logs: Store agent activity logs and reviewer decisions for SOC 2 / GDPR readiness.
  6. Human-in-loop: Mandatory review for High-risk flags; random sampling for quality checks on Low risk.
  7. Model behavior tests: Regular red-team prompts to probe for hallucination or disclosure behavior.

Measuring outcomes — metrics that ops teams care about

Report experiments in business terms. Use these KPIs:

  • Time-to-first-pass (minutes per contract) — captures immediate efficiency gains.
  • Human review time (minutes saved) — convert to FTE hours and dollars saved.
  • Accuracy/fidelity — percent of extracted fields that match human baseline.
  • Hallucination incidents — number per 100 contracts; aim for near-zero with provenance requirements.
  • Legal escalation rate — percent of contracts escalated to legal pre-automation vs post-automation.
  • Adoption rate — percent of ops team using agent outputs without rework.

Example ROI calc (conservative): If each contract review originally took 60 minutes and agent+human validation takes 12 minutes, one full-time equivalent (FTE) could process ~5x more contracts. Multiply FTE hours freed by hourly burden to estimate yearly savings — then factor subscription & monitoring costs.

Common failure modes and mitigations

  • OCR noise — preprocess and set OCR confidence thresholds.
  • Ambiguous dates/definitions — force the agent to return the exact source line and normalize date formats with a standard parser.
  • Overconfident summaries — require confidence scores and block external publication if below threshold.
  • Privacy leaks — pre-redaction and strict workspace controls; never store unredacted copies in the agent workspace.

Practical templates (copy/paste foundations)

Use these starter templates in Claude Cowork. They worked as baseline in our experiments.

Extractive template (for field extraction)

System: You are a contract analyst. Output JSON with: summary_one_sentence, extracted_fields{partyA, partyB, effective_date, term, auto_renewal(boolean), termination_notice, indemnity_summary}, source_excerpt (for each field with file:page:paragraph). Do not invent facts. If you cannot find a field, set it to null.

Risk scoring template

System: For each contract, evaluate risk dimensions (termination_exposure, indemnity_breadth, data_risk) on 1-10. Return numerical scores, mapped tier (Low/Medium/High), and a 1-2 sentence rationale with the supporting source excerpt and page pointer.

Plain-language customer summary template

System: Write a 120-150 word plain-language summary for non-legal readers. Bullets: (1) key dates and renewal actions, (2) customer obligations, (3) three red flags. Provide a confidence score (0–100) per bullet and each bullet's source citation.

Real-world examples and lessons learned (experience)

We applied these experiments across three SMB use cases:

  • Vendor onboarding: Reduced onboarding time for standard SaaS vendors by 70% using extractive templates and risk scoring.
  • Customer renewals: Generated plain-language summaries for account teams, reducing churn-related disputes by surfacing auto-renewal triggers earlier.
  • Procurement reviews: Automated clause crosswalks caught conflicting SOW obligations in 4% of sampled contracts, saving negotiation cycles.

Future-proofing and advanced strategies (2026+)

As document AI evolves in 2026, plan for:

  • Encrypted in-place agents: Agents that operate inside your storage and never exfiltrate raw text — adopt when available to minimize data movement.
  • Hybrid human-AI workflows: Use AI for triage and humans for final signoffs; measure both sides of the loop.
  • Continuous monitoring: Run periodic drift tests to check for degradation after model updates (late 2025–early 2026 model refreshes changed some tokenization behaviors).
  • Plug-in telemetry: Track which prompt templates produce the fewest low-confidence items and standardize them in playbooks.

Checklist to run your first 2-week pilot

  1. Pick 25 representative contracts (mix of sizes and types).
  2. Preprocess: redact PII, run OCR, and store originals in immutable backup.
  3. Set up Claude Cowork workspace with RBAC and logging.
  4. Run Week 1 extractive template to gather baselines.
  5. Implement quote-backed extraction and provenance (Week 2 pattern).
  6. Measure time-to-first-pass, accuracy, hallucination incidents; decide go/no-go.

Closing notes — trust but verify

Agentic document tools like Claude Cowork are powerful productivity levers for SMB ops — but they require disciplined prompt engineering, robust safety controls, and human governance to deliver consistent ROI. Our serial experiment log shows that the combination of structured prompts, quote-backed extraction, and mandatory human checks produces both speed and trust.

Actionable takeaways

  • Start small: pilot with 25–50 contracts and a strict redaction and provenance policy.
  • Require quote-backed outputs to eliminate hallucinations.
  • Use risk scoring for legal triage and conserve legal hours for true high-risk items.
  • Measure time saved and error rates; convert savings into FTE equivalents to show ROI.

Call to action

Ready to run this in your ops team? Use the provided templates and two-week pilot checklist to get started. Subscribe to our weekly experiment newsletter for prompt packs, safety audit scripts, and a downloadable Claude Cowork contract-summary playbook tuned for SMBs in 2026.

Advertisement

Related Topics

#Experiment#AI#Documentation
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-07T00:24:38.618Z