Facility data hygiene: merge duplicates, drop junk-named facilities
Cleans up the crawl-generated facility table that surfaced garbage on /Facilities («بیمارستان هستم», «... از مدجابز», bare «کلینیک», «سازمان برنامه جنوبی» x3): - FacilityMatcher.IsJunkName: shared detector for non-names — bare type words, cores made only of filler/verb tokens, and leaked crawl-source/placeholder text. Added داروخانه/آسایشگاه to the generic type words so bare ones are caught and dedupe better. - HeuristicListingParser.ExtractFacilityName now rejects junk candidates (and emoji), so new ingests fall back to the shared placeholder instead of forging a fake facility. - IngestionService.MergeAndCleanFacilitiesAsync (+ admin button): folds junk facilities into the placeholder and merges Persian-fuzzy duplicates into one keeper, repointing their shifts/jobs first. Hard guard: only purely crawl-generated, unmanaged facilities are removed — employer-owned and verified facilities are never touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -145,6 +145,18 @@ public class IndexModel : PageModel
|
||||
return RedirectToPage();
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Clean up the crawl-generated facility table: merge Persian-fuzzy duplicate facilities and fold
|
||||
/// junk-named ones («بیمارستان هستم»، «... از مدجابز»، bare «کلینیک») into the shared placeholder,
|
||||
/// repointing their listings first. Employer-owned / verified facilities are never touched.
|
||||
/// </summary>
|
||||
public async Task<IActionResult> OnPostCleanFacilitiesAsync()
|
||||
{
|
||||
var (merged, cleaned) = await _ingest.MergeAndCleanFacilitiesAsync();
|
||||
IngestMessage = $"پاکسازی مراکز: {merged} مرکزِ تکراری ادغام و {cleaned} مرکزِ بینام/نامعتبر حذف شد (آگهیهایشان به مرکزِ معتبر یا «نامشخص» منتقل شد). مراکز ثبتشده توسط کارفرما/تأییدشده دستنخورده ماند.";
|
||||
return RedirectToPage();
|
||||
}
|
||||
|
||||
private async Task LoadAsync()
|
||||
{
|
||||
Queue = await _db.RawListings
|
||||
|
||||
Reference in New Issue
Block a user