Facility data hygiene: merge duplicates, drop junk-named facilities
CI/CD / CI · dotnet build (push) Successful in 1m51s
CI/CD / Deploy · hamkadr (push) Successful in 2m17s

Cleans up the crawl-generated facility table that surfaced garbage on /Facilities
(«بیمارستان هستم», «... از مدجابز», bare «کلینیک», «سازمان برنامه جنوبی» x3):

- FacilityMatcher.IsJunkName: shared detector for non-names — bare type words, cores
  made only of filler/verb tokens, and leaked crawl-source/placeholder text. Added
  داروخانه/آسایشگاه to the generic type words so bare ones are caught and dedupe better.
- HeuristicListingParser.ExtractFacilityName now rejects junk candidates (and emoji), so
  new ingests fall back to the shared placeholder instead of forging a fake facility.
- IngestionService.MergeAndCleanFacilitiesAsync (+ admin button): folds junk facilities
  into the placeholder and merges Persian-fuzzy duplicates into one keeper, repointing
  their shifts/jobs first. Hard guard: only purely crawl-generated, unmanaged facilities
  are removed — employer-owned and verified facilities are never touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
soroush.asadi
2026-06-21 05:40:29 +03:30
parent 8be275596b
commit 88eca92333
5 changed files with 137 additions and 2 deletions
@@ -145,6 +145,18 @@ public class IndexModel : PageModel
return RedirectToPage();
}
/// <summary>
/// Clean up the crawl-generated facility table: merge Persian-fuzzy duplicate facilities and fold
/// junk-named ones («بیمارستان هستم»، «... از مدجابز»، bare «کلینیک») into the shared placeholder,
/// repointing their listings first. Employer-owned / verified facilities are never touched.
/// </summary>
public async Task<IActionResult> OnPostCleanFacilitiesAsync()
{
var (merged, cleaned) = await _ingest.MergeAndCleanFacilitiesAsync();
IngestMessage = $"پاک‌سازی مراکز: {merged} مرکزِ تکراری ادغام و {cleaned} مرکزِ بی‌نام/نامعتبر حذف شد (آگهی‌هایشان به مرکزِ معتبر یا «نامشخص» منتقل شد). مراکز ثبت‌شده توسط کارفرما/تأییدشده دست‌نخورده ماند.";
return RedirectToPage();
}
private async Task LoadAsync()
{
Queue = await _db.RawListings