Skip to content

(skill) update filename match fast path#2162

Open
edknv wants to merge 15 commits into
mainfrom
edwardk/skill-single-source
Open

(skill) update filename match fast path#2162
edknv wants to merge 15 commits into
mainfrom
edwardk/skill-single-source

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented May 28, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested review from a team as code owners May 28, 2026 23:50
@edknv edknv requested a review from jdye64 May 28, 2026 23:50
@edknv edknv requested review from jperez999 and sosahi May 28, 2026 23:51
@edknv
Copy link
Copy Markdown
Collaborator Author

edknv commented May 28, 2026

/nvskills-ci

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR tightens the filename fast path in the nemo-retriever skill so that PDF basename matching now requires the .pdf extension (stem-only matching is removed to avoid false positives on common English words), and propagates that constraint into documentation. It also ships several robustness fixes that address previous review feedback.

  • Matching logic (filename_fast_path.py): find_matches now requires the full name.pdf literal in the query; the duplicate sidecar candidate path ({stem}.pdf.pdf_extraction.json{stem}.pdf_extraction.json) is corrected; extract_pages redirects subprocess stdout to DEVNULL and wraps each file in a try/except CalledProcessError so a single bad PDF no longer aborts the whole run; page_records gains a json.JSONDecodeError handler; and os.listdir is wrapped to emit a clean error on missing ./pdfs/.
  • Docs / metadata: references/query.md, SKILL.md, and references/pitfalls.md are updated to match the new fast-path trigger condition and remove the output.json write step; BENCHMARK.md and skill-card.md reflect the updated NVSkills-Eval PASS result.

Confidence Score: 5/5

Safe to merge; all previously flagged issues are addressed and the logic changes are consistent with the updated documentation.

The matching change is narrow and well-motivated, the error-handling additions are defensive improvements with no regression risk, and the documentation is kept in sync. The only open gap is the absence of unit tests for the changed matching logic.

skills/nemo-retriever/scripts/filename_fast_path.py — the find_matches behaviour change has no automated coverage.

Important Files Changed

Filename Overview
skills/nemo-retriever/scripts/filename_fast_path.py Tightens filename matching to require the .pdf extension, fixes the duplicate sidecar candidate path, adds per-file error handling for extraction failures, and redirects subprocess stdout to DEVNULL to fix stdout protocol pollution — all previously-reviewed issues are addressed.
skills/nemo-retriever/references/query.md Updated fast-path trigger condition to document that the .pdf extension is now required, aligning documentation with the code change in filename_fast_path.py.
skills/nemo-retriever/SKILL.md Removes the instruction to write output.json and updates query-turn guidance to synthesize answers in-place; cosmetic improvements to hard-limit descriptions.
skills/nemo-retriever/references/pitfalls.md Trimmed failure-mode list by moving cold-start and low-relevance entries to the CLI reference docs, reducing duplication that Tier 2 had flagged.
skills/nemo-retriever/BENCHMARK.md Reflects updated NVSkills-Eval results: overall verdict flipped to PASS, Tier 2 duplicate finding resolved (0 findings), finding count updated to 21.
skills/nemo-retriever/skill-card.md Updated skill metadata: version bumped, output format description extended, evaluation summary updated to match new BENCHMARK.md, reference list reordered.
skills/nemo-retriever/skill.oms.sig Signature bundle refreshed to cover the updated file set; no code change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([filename_fast_path.py called with query]) --> B{list ./pdfs/}
    B -- FileNotFoundError/PermissionError --> C[print ERROR to stderr\nexit 1]
    B -- success --> D[find_matches: scan basenames\nrequire name.pdf in query_lower]
    D -- no matches --> E[print NO_MATCH\nexit 0]
    D -- matches found --> F[extract_pages: run retriever\nstdout=DEVNULL per file]
    F -- CalledProcessError --> G[WARN to stderr, skip file\ncontinue loop]
    F -- success --> H[write sidecar JSON\nto /tmp/pdf_text/]
    G & H --> I[rank_pages: load sidecars\nscore by token freq]
    I -- json.JSONDecodeError --> J[ERROR to stderr\nreturn empty list]
    I -- no sidecar found --> K[skip match]
    J & K & I --> L{any scored pages?}
    L -- no --> M[print NO_TEXT\nexit 0]
    L -- yes --> N[print JSON ranking\n+ top page text\nexit 0]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
skills/nemo-retriever/scripts/filename_fast_path.py:51-63
**Missing unit tests for changed matching logic**

The `find_matches` function has a material behaviour change — stems are no longer matched without the `.pdf` extension — but there are no unit tests covering the happy path, the extension-required path, or the `MIN_STEM_LEN` guard. The `test-coverage-new-code` rule requires tests for changed business logic. A query like `"Tell me about report_q3"` would have matched before and silently returns `NO_MATCH` now, with no automated safety net to catch regressions or future accidental reversions.

Reviews (6): Last reviewed commit: "lint" | Re-trigger Greptile

Comment thread skills/nemo-retriever/scripts/filename_fast_path.py
Comment thread skills/nemo-retriever/scripts/filename_fast_path.py
Comment thread skills/nemo-retriever/scripts/filename_fast_path.py
Comment thread skills/nemo-retriever/scripts/grep_corpus.py
@edknv
Copy link
Copy Markdown
Collaborator Author

edknv commented May 29, 2026

/nvskills-ci

@edknv edknv changed the title (skill-eval) add filename match fast path (skill) update filename match fast path May 29, 2026
@edknv
Copy link
Copy Markdown
Collaborator Author

edknv commented May 29, 2026

/nvskills-ci

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants