RecorrectDoctorRolesAsync (+ admin button «اصلاح نقش»): re-runs the keyword parser + doctor-role
guard over the stored text of existing aggregated listings currently labeled «پزشک عمومی», and
corrects RoleId + the generic title in place when the text actually names a more specific role
(dentist, «متخصص», lab, …). No AI call, no delete/recreate — IDs and indexed URLs unchanged, only
GP-labeled rows touched. Cleans up the dentist/ENT/«متخصص غدد» mislabels already published.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the per-role (dentist/specialist) patches with one rule: «پزشک عمومی» is the AI fallback,
so whenever the keyword parser already extracted a more specific role from the same text, use that
(dentist, lab, OR tech, mislabeled nurse, …). Falls back to «پزشک متخصص» when the text says
specialist but the parser found nothing more specific. Only ever overrides the weak GP default, so
genuine GP ads are untouched. Applies to new ingests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extends the doctor-role guard: the AI defaults unclear (and even clearly non-GP) doctor ads to
«پزشک عمومی», so a dentist ad («دعوت به همکاری دندانپزشک») published as «استخدام پزشک عمومی».
When the chosen role is a generic doctor but the ad text says «دندانپزشک», correct it to
دندانپزشک (specialist correction stays for «متخصص/فوق تخصص/فلوشیپ»). Applies to new ingests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Recommendations only scored open shifts, but almost all roles — doctors especially — exist as
استخدام (jobs), not dated shifts, and the only shifts are a handful of nurse shifts. So a visitor
who prefers «پزشک» got nurse-shift recommendations (scored by city/freshness) because there were
no doctor shifts to surface.
Now the engine scores BOTH shifts and job openings: role/city/facility/pay/freshness apply to
each, behavioral affinities are derived from shift AND job interest events, and the merged top-N
is returned. Recommendation can now carry a Shift or a JobOpening; the card renders either
(job → /Jobs/Details with employment type + salary; shift → unchanged with hour-bar). Cold start
interleaves the freshest of both.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FacilityMatcher treated «شبانه روزی»/«خیریه»/«دولتی»/«خصوصی» as part of a name, so a real
facility merged into a generic one when they shared a descriptor — «درمانگاه شبانهروزی اسفند»
collapsed into the existing «پلی کلینیک شبانه روزی», losing «اسفند». Add these descriptors to
the stripped type-words so matching compares the distinctive core («اسفند») instead. Side
benefit: bare descriptor-only names («پلی کلینیک شبانه روزی») now resolve to junk and get
folded into the placeholder by the cleanup, rather than masquerading as a real facility.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(1) Specialist guard: the AI sometimes labels a clearly-specialist ad («پزشک متخصص گوش و
حلق و بینی»، «فلوشیپ»، «فوق تخصص») as «پزشک عمومی», so an ENT post published as
«استخدام پزشک عمومی». When the primary role is GP but the ad text names a specialist, swap
it to «پزشک متخصص» (the subspecialty stays as a tag).
(2) Phone type: the landline regex 0\d{2,3} also matched 09xx MOBILE numbers and labeled them
«تلفن ثابت». Iranian landline area codes are 0[1-8]xx (021/026/…), never 09 — restrict it so
mobiles are no longer mislabeled as landlines.
Both apply to new ingests; existing mislabeled rows correct on turnover/reprocess.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
RunAsync now calls a new RunPostIngestCleanupAsync at the end of each crawl: archive
out-of-scope/duplicate listings, merge duplicate + fold junk facilities, and backfill missing
Tehran coords. All in-place, reversible for listings, guarded for facilities, and pure DB+CPU
(no AI/network) so it is cheap to run every ingest. The cleanup counts are appended to the
run-log detail. This keeps legacy + freshly-arrived junk from accumulating without the admin
having to click the cleanup buttons after each run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New MedboomListingSource: a WordPress medical-classifieds board crawled like medjobs
(wp-sitemap.xml -> posts-post-N.xml, newest first), filtered to clinical-role slugs and
Tehran-only for launch. medboom skews toward doctors/dentists/pharmacists and carries both
hiring and availability posts, so it directly broadens the role mix the nurse-heavy Divar
content lacks. Iranian-hosted -> no proxy/VPN needed (relevant now that Telegram is off).
Wired like the other sources: AppSetting toggles (MedboomEnabled/MaxAds/UseProxy) + EF
migration, SettingsService persistence, admin Settings UI, DI registration. Off by default.
Validated against live data: Tehran clinical ads at named clinics (pharmacy/dental/etc.).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keep only ads located in Tehran: pre-drop slugs naming other major cities to save fetches,
then authoritatively keep ads whose text states «تهران» (the og:description reliably says
«شهر تهران»). Pool 5x candidates so the Tehran filter still yields a full batch. Validated
against live data: ~16/18 clinical candidates are Tehran. Nationwide expansion later becomes
a per-source city setting once the engine is proven.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New IranEstekhdamListingSource: reads the site monthly ad sitemaps
(sitemap-ads.xml -> sitemap-ads-YYYY-M.xml), keeps only ad URLs whose Persian slug names a
clinical role (veterinary/non-clinical excluded), then extracts each ad title + description
(+ phone). These are employer ads at NAMED facilities, so they directly improve the
unknown-facility problem the classifieds content has.
Wired in like Medjobs: AppSetting toggles (IranEstekhdamEnabled/MaxAds/UseProxy) + EF
migration, SettingsService persistence, admin Settings UI, and DI registration. Off by
default; the medical-gate validator + AI auditor + junk filters screen results downstream.
Note: e-estekhdam / jobinja / jobvision are JS-rendered SPAs whose ad lists are not in static
HTML, so they need API reverse-engineering (a separate effort), not this static-scrape path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The junk-removal half of the facility cleanup silently no-op'd because it looked up the
shared placeholder by the exact UnknownFacilityName constant («نامشخص / ثبت نشده»), but
production data uses an older wording («مرکز درمانی (نامشخص)»), so the lookup returned null
and the whole junk pass was skipped (only the duplicate-merge half ran).
Now resolve the placeholder by the «نامشخص» marker and pick the bucket with the most
listings (the real one), and exclude it from the merge pass by id. Re-running the cleanup
will fold «بیمارستان هستم», «... از مدجابز», bare type-word facilities, etc. into it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cleans up the crawl-generated facility table that surfaced garbage on /Facilities
(«بیمارستان هستم», «... از مدجابز», bare «کلینیک», «سازمان برنامه جنوبی» x3):
- FacilityMatcher.IsJunkName: shared detector for non-names — bare type words, cores
made only of filler/verb tokens, and leaked crawl-source/placeholder text. Added
داروخانه/آسایشگاه to the generic type words so bare ones are caught and dedupe better.
- HeuristicListingParser.ExtractFacilityName now rejects junk candidates (and emoji), so
new ingests fall back to the shared placeholder instead of forging a fake facility.
- IngestionService.MergeAndCleanFacilitiesAsync (+ admin button): folds junk facilities
into the placeholder and merges Persian-fuzzy duplicates into one keeper, repointing
their shifts/jobs first. Hard guard: only purely crawl-generated, unmanaged facilities
are removed — employer-owned and verified facilities are never touched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per the project archive-not-delete convention, the in-place purge now sets out-of-scope
and duplicate aggregated jobs/shifts to ShiftStatus.Archived instead of hard-deleting:
- The row is retained for analysis and the change is reversible.
- The listing drops out of every public screen and the sitemap (which filter Status == Open).
- Its detail page now returns 410 Gone (the standard permanent-removal signal) so search
engines deindex it cleanly, instead of leaving the off-topic page live at 200 or hard-404ing.
Dedupe of job reposts archives the older copies the same way. Coordinate backfill now also
skips non-Open rows. Valid listings are untouched, so IDs/URLs stay stable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Jobs now keep the AI-extracted salary (d.PayAmount ?? parsed.PayAmount); they
previously used only the parser figure, so every aggregated opening showed «توافقی».
- Geocoder also scans the ad body, so Tehran ads that name a neighbourhood only in
free text («… در سهروردی») get an approximate map point.
- New BackfillCoordsAsync (+ admin button): fills missing coords on existing aggregated
listings from their stored text, in place — no ID/URL churn, SEO-safe.
- New PurgeInvalidAggregatedAsync + DedupeJobsAsync (+ admin button): in-place removal of
out-of-scope (domestic/promo/spam) aggregated jobs/shifts and duplicate job reposts,
keeping valid listings' IDs.
- Jobs detail page always renders the location card (matches Shifts) instead of hiding it
when coords are missing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-check of live applicants found two gaps:
- «کمک بهیار آماده به کار» — the availability phrase glued onto the role. StripRoleModifiers
now removes «آماده به کار / آماده همکاری / جویای کار / جهت همکاری» phrases before
token-stripping, so the role collapses to «کمک بهیار».
- «خانم امورسبک منزل» — light-housework domestic helpers (not کادر درمان). Validator
now discards ads with «امور منزل / نظافت منزل / خدمتکار / مستخدم …» markers.
Both take effect for existing data on the next applicant reprocess.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mark up the result list as a schema.org ItemList (ordered listing URLs) so Google
reads the landing/list pages as a curated collection. Emitted alongside the
breadcrumb JSON-LD when there are results.
Improvement 6 of the backlog (SEO).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add SeoJsonLd.Breadcrumb + Crumb record + _Breadcrumbs partial, and wire a trail
into the Jobs/Shifts list (landing) and detail pages: خانه › استخدام/شیفت › {نقش}
› {شهر|عنوان}. The role crumb links to the role landing page (more internal
links), and Google can show the breadcrumb path in results. Detail pages emit it
alongside the existing JobPosting JSON-LD.
Improvement 5 of the backlog (SEO).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
JobPosting requires a valid hiringOrganization; emitting «نامشخص / ثبت نشده» (the
placeholder for aggregated ads with no named center) makes Google reject the
posting and can flag invalid structured data across the site. Add
SeoJsonLd.HasRealEmployer and gate the JobPosting/ShiftPosting <script> on it, so
only listings with a genuine employer get marked up (those are the Jobs-eligible
ones anyway).
Improvement 3 of the backlog (SEO).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Exact ContentHash dedup misses the same ad reposted with slightly different text
(e.g. the ~18 repeated «کمکیار آقا»). DedupeTalentAsync collapses open aggregated
applicants by two high-precision signals — identical phone, or identical
(role, city, normalized description core with digits/«… پیش» time-phrases
stripped) — keeping the newest of each group. Runs at the end of both RunAsync
and ReprocessAsync; removed count surfaces in the run log.
Improvement 1 of the data-quality/SEO backlog.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The server clock is correct (UTC); the app rendered UTC wall-clock directly, so
the run log showed ~3.5h behind Tehran. Add JalaliDate.ToTehran (flat UTC+3:30 —
Iran dropped DST in 2022) + DateTimeLabel, and convert the UTC-stored timestamp
displays (ingestion run log, RawListing FetchedAt, report CreatedAt). Shift
start/end inputs are TimeOnly, left as-is.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reprocess deletes+rebuilds aggregated listings, which changes their IDs. Shift/Job
detail pages are indexed and in the sitemap, so churning them would 404 ranked
URLs. «آماده به کار» pages are NoIndex + Disallow, so rebuilding them has zero SEO
impact — and that's where all the duplicate/sprawl problems were.
ReprocessAsync(talentOnly: true) now only deletes/rebuilds TalentListings and
skips non-talent raws (leaving shift/job listings + their RawListing links
untouched). Admin button relabelled «پردازش مجددِ آماده به کارها (امن برای SEO)».
Shifts/jobs self-clean via normal ingestion turnover.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Where deterministic geocoding gives up (neighborhood not in the TehranGeo table),
fall back to the registered AI model: the auditor now also returns approximate
lat/lng for a recognized Tehran neighborhood (folded into the existing single
audit call — no extra requests), and Publish uses it only after the source ad and
the local table, and only when it falls inside greater Tehran (InTehran bbox
guard rejects hallucinated points). Coords order: Divar point → TehranGeo → AI.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-checked live data and found cases the first pass missed:
- Gender baked into roles («پرستار آقا», «کمک بهیار آقا») → StripRoleModifiers
removes آقا/خانم/مرد/زن/کارآموز/ارشد… from role names (none of the real roles
contain these), collapsing the sprawl; gender still lives on the Gender field.
- «کمکیار» vs «کمک بهیار» forking → alias maps them to one role.
- Personality words («خوشاخلاق», «دلسوز», «منظم»…) added to the tag stop-list.
- Prompt: gender goes to the gender field, not the role.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Many Medjobs/Telegram ads name a Tehran neighborhood («ونک», «تهرانپارس»…) but
carry no coordinates. New TehranGeo geocoder maps ~45 neighborhood names to a
rough center; Publish falls back to it (from the resolved district / AI district
/ area note) when the source ad has no point. Shown via the existing «محدودهٔ
تقریبی» circle + disclaimer — never a precise pin. Tehran-only; extends the
existing approx-coords feature so non-Divar listings can show a map too.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
We captured Divar's privacy-fuzzed coords on RawListing but discarded them for
the listings that need them: unnamed-facility shifts/jobs dropped them (to avoid
piling on the shared placeholder) and applicants had no coordinate field at all.
- Add Lat/Lng to Shift, JobOpening, TalentListing (migration ListingApproxCoords).
- Publish stores the source ad's approx coords on each aggregated listing.
- Detail pages render the map from the listing's own coords (fallback: facility),
and aggregated coords show as a shaded «محدودهٔ تقریبی» circle (not a precise
pin) via _NeshanMap data-approx, with a disclaimer. Applicants get a map card
(they had none) + the page now loads the Neshan key.
Only Divar provides coords; the map needs NeshanMapKey set in admin settings.
Existing rows get coords once reprocessed (RawListing already has them).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Qualified live applicants and found three problems, all fixed:
- Duplicate cards: one ad fanned out into «پرستار» + «پرستار کودک» (same person).
Applicants now publish ONE listing (no role fan-out); secondary roles → tags.
- Role sprawl: modifiers became roles. Prompt now returns the BASE profession
and pushes age-group/ward/seniority to tags; new roles only for a genuinely
new base profession (تکنسین داروخانه ✓, پرستار کودک ✗).
- Tag/category noise: categories pinned to the 5 fixed groups (+سایر, never
invented); BuildTags drops pay/contact/location/fragment words.
Reprocess action: IngestionService.ReprocessAsync re-runs the current pipeline
over every stored RawListing WITHOUT re-fetching (keeps the raw text, so nothing
is lost to sources only exposing recent posts), deleting the old aggregated
posts and republishing cleanly. Admin dashboard button «پردازش مجددِ آیتمهای
ذخیرهشده» runs it on a background scope; result lands in the run-log.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Google Search Console shows all top queries are «استخدام [نقش] [شهر]», but the
filtered index pages all shared the generic title «موقعیتهای استخدامی» and
weren't in the sitemap, so nothing ranked for those exact searches.
- Jobs/Shifts/Talent index pages now set a dynamic <title>/<h1>/meta from the
active role+city (e.g. «استخدام پزشک عمومی در تهران»).
- Pretty SEO routes /استخدام/{role}/{city?} and /شیفت/{role}/{city?} (via
AddPageRoute) resolve slugs → filters; unknown slug → 404. The layout already
derives the canonical from the path, so each pretty URL is its own canonical
and the query-string forms canonicalize to /Jobs (no duplicate content).
- sitemap.xml now lists role-only and role×city landing URLs for every combo
with live listings (URL-encoded), so Google discovers them.
- New SeoSlug helper (Persian-tolerant: ي/ك, ZWNJ, hyphen/space).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phone fix: shifts/jobs showed Facility.Phone, but unnamed ads all share one
placeholder facility, so every such listing displayed the same stale number
while the ad's real phone sat unused in the description. ContactMethod is now
attachable to a Shift/JobOpening (not just talent); ingestion stores the ad's
own number(s) on each listing and the detail pages render them (new
_ContactList partial), falling back to the facility phone only when the ad had
none. Migration ShiftJobContacts (nullable owner FKs) — auto-applies on deploy.
Stale applicants: skip «آماده به کار» posts older than 7 days at ingest, by the
source's real timestamp (Telegram <time>, Bale date) or a Persian time-ago
phrase in the text (Divar «۲ هفته پیش»). Recorded as Discarded; shifts/jobs
are not aged out.
Admin: Review page now shows a «مشاهده آگهی در منبع» link (RawListing.SourceUrl)
so the source post can be checked before publishing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Unknown roles from the AI are now resolved-or-CREATED (Persian-normalized dedupe) instead of dropped/fallback; new role gets the AI's category, assigned to the applicant.
- AI output gains category + tags; AI-detected skills/requirements (ICU, MMT, پروانهدار…) now fold into the applicant's searchable Tags.
- System prompt is hardcoded in AppSetting.DefaultPrompt and used directly by the auditor; admin sees it read-only (cannot edit/break it).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Test-AI button called AuditAsync, which caught every exception and returned
null, and used EnsureSuccessStatusCode() (discarding the response body). So a
failing AI service only ever produced a generic 'no response' message with no
detail — impossible to diagnose.
- Add IAiAuditor.TestAsync: runs the real call and returns a detailed Persian
diagnostic — HTTP status + response body on non-2xx, raw body when the shape
isn't OpenAI-compatible, and network/proxy/timeout specifics on exceptions.
- AuditAsync now logs the actual HTTP status + response body (and proxy state)
instead of a bare warning, so server logs show why a call failed.
- ExtractContent / ParseVerdict no longer throw on unexpected JSON; they return
null so the caller can show the raw body.
- Settings 'Test AI' button uses TestAsync; result box renders multi-line and
switches to alert-error styling when the test fails.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- SearchHighlight.Snippet: extracts a ±70-char window around the first
matching term and marks it (with ellipses) — the ES "highlight" fragment.
- Result cards (shift/job/talent) now show that snippet from the matched
description/tags when a query is present, so you SEE where the term hit
(e.g. «…دارای مدرک <mark>mmt</mark>…») instead of just the role.
- Typeahead suggestions gain a highlighted "sub" line (talent→tags,
shift→city·specialty, job→facility·city) so matches show in the dropdown too.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Tags: parser extracts cert/skill keywords (mmt, ICU/CCU, دیالیز, اتاق عمل,
اورژانس, مسئول فنی, پروانهدار…) + role + city into TalentListing.Tags
(+ migration); shown as chips on cards.
- Deep search on /Talent: «جستجوی عمیق» box does Postgres ILIKE across
tags, description, person, area, role, city (every term must match);
matches are highlighted with <mark> via SearchHighlight.
- Never delete: ShiftStatus.Archived + the admin «بایگانی گروهی» action now
ARCHIVES aggregated posts (hidden from site, kept in DB) and leaves the
raw crawl rows intact — a permanent archive for future analytics.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- ContactMethod entity (Type + Value + SortOrder) 1→N on TalentListing (+ migration).
- Parser extracts ALL contacts: multiple phones + landlines, email, and
socials (Instagram/Telegram/Bale/WhatsApp/website) via URLs and Persian
keyword cues; primary Phone kept for cards.
- ContactInfo helper: per-type label/icon/clickable href (tel:/mailto:/t.me/…).
- Ingestion attaches contacts to each (fanned-out) talent listing; manual
Review re-parses to attach them + the admin-typed phone.
- Talent details renders the full contact list as buttons; falls back to the
single phone, then the Divar source link.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
An ad like «استخدام پرستار سالمند و کودک و همراه بیمار» names several roles;
we kept only the first. Now:
- Parser collects ALL roles (ParsedListing.RoleNames): exact taxonomy
matches (substring-deduped so پرستار⊂پرستار سالمندان) plus synonyms
(سالمند→پرستار سالمندان, کودک/همراه بیمار→پرستار, اتاق عمل→تکنسین اتاق عمل…),
capped at 4.
- Ingestion publishes one Shift/Job/Talent per resolved role (AI role +
parser roles, distinct, capped), so each role is independently
browsable and filterable. RawListing dedupe is unchanged (one raw → N posts).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A Divar applicant whose number is behind the login-gated reveal should
still publish — the detail page already links back to Divar for the phone.
Talent now scores role(40)+medical(10)=50, so role+medical alone passes
without a phone; phone just adds confidence.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Medical-flavored ads like «کارگاه بوتاکس و فیلر… ویژه پزشکان ۱۰٪» passed the
medical gate and got misclassified as a پزشک عمومی shift with a bogus 10%
share. Now: if a course/event/product marker is present and there's no
staffing intent (hiring/shift/availability), the item is auto-discarded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a «شبکههای اجتماعی» admin section + scheduler that publishes a daily
«کادر آمادهبهکار امروز» digest:
- AppSetting: social toggles, posts-per-day, editable header/footer,
per-channel bot token + chat id (Telegram, Bale), Instagram enable +
extra hashtags, proxy toggle, last-posted timestamp (+ migration).
- SocialPostService: builds today's talent digest as text, posts to
Telegram and Bale via their bot sendMessage APIs (proxy-aware), and
produces an Instagram caption + auto hashtags (role/city based).
- SocialPostWorker: posts N times/day, evenly spaced, self-paced; reads
settings live so it's togglable without redeploy.
- /Admin/Social: credentials + header/footer + posts/day, live preview of
today's message, «ارسال اکنون» button, and an Instagram caption pack
with copy button (semi-automatic — you post the image manually).
- Nav link added.
Telegram/Bale post as TEXT (per request). The Vazirmatn image card for
Instagram is phase 2 (needs SkiaSharp+HarfBuzz + a TTF font).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Parser now reads the currency: ریال amounts (incl. «میلیون ریال» and
numbers with no تومان unit but ≥200M) are converted to تومان (÷10), so
«۴۰۰٬۰۰۰٬۰۰۰ ریال» shows as ۴۰٬۰۰۰٬۰۰۰ تومان instead of 400M.
- Aggregated facility fallback name no longer embeds the source
(«مرکز درمانی (از مدجابز)» → «مرکز درمانی (نامشخص)»).
- Talent details only ever names Divar as a fallback source (when the
number couldn't be extracted); medjobs/telegram are never shown publicly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Talent «آماده به کار» now has its own freshness window (21 days, vs 30
for jobs) since availability goes stale fast; archiver, browse, and home
use TalentCutoffUtc.
- Expired/filled job openings and past/filled shifts now emit
robots noindex so Google drops dead listings instead of keeping
soft-404 pages. (Talent details were already noindex.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- HarvestPhones was run over the whole page, so Medjobs' own header/footer
number (09101016110) was appended to every ad. Now harvest only the ad's
description region in Medjobs + Website sources; the protected number
still comes from the reveal call. No more duplicate number across ads.
- The amount extractor read phone digits as a Toman price
(۹,۱۰۱,۰۱۶,۱۱۰ تومان). The parser now strips «شماره تماس…» lines and
mobile/landline numbers before extracting money, and only accepts 6–10
digit numbers with no leading zero (phones/ids start with 0 or are 11+).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Harvest now keeps each post's token, so we build a real post URL
(divar.ir/v/{token}) instead of a generic link.
- For each post we fetch the detail JSON (posts-v2/web/{token}) and
harvest any contact number from it — covering the very common case
where the poster writes the phone into the ad description. Divar's
click-to-reveal is login-gated, so this gets the in-text numbers
without auth; fails soft (blocking/errors → skip).
- HarvestPhones hardened with digit-boundary guards so it can't grab a
slice of a longer numeric id/timestamp inside JSON.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The contact phone on medjobs.ir is loaded by JS only after clicking
«تماس با این آگهی» — it isn't in the page HTML, so scanning the markup
found nothing. We now replay that exact reveal request server-side:
- POST https://medjobs.ir/wp-admin/admin-ajax.php with
action=isatis_protect_contact & id=<listingId> (no nonce needed),
then harvest the tel: numbers from the returned HTML table.
- Listing id is pulled from the page via the WP shortlink (?p=ID),
postid-/data-id, or the visible «کد آگهی» as a fallback.
- Numbers are appended to the ad text so the parser/AI capture them and
they reach the published listing. Wrapped in try/catch so a failed
reveal never breaks ingestion; uses the same (proxy-aware, brotli-
decompressing) client as the page fetch.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
AI (when enabled, now that the server proxy is up):
- AiStructured gains phone, personName, yearsExperience, isLicensed.
- The auditor appends an authoritative output-schema to the admin prompt
so classification stays correct even with an older stored prompt — it
now classifies kind as shift|job|talent and extracts the contact phone
and talent details.
- Ingestion publish prefers the AI's tags (kind/role/city/facility/phone +
talent fields) over the heuristic parser when present.
- Default prompt updated to describe the three kinds + new fields.
Phone extraction from websites (Medjobs / generic sites), where the
number sits behind a "تماس با این آگهی" reveal:
- HtmlUtil.HarvestPhones scans the full markup for tel: links, JSON-LD
"telephone", data-*phone* attributes, and inline Iranian mobile/landline
numbers (Persian digits folded), normalized (mobiles 09…, landlines 0…).
- Medjobs + Website sources append harvested numbers to the ad text so the
parser/AI capture them; manual review then prefills the phone too.
- Parser phone extraction now also captures a landline as a fallback.
Note: if a site loads the number purely via XHR (not in HTML), a
per-source reveal endpoint would be a follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a third listing kind alongside Shift/Job for healthcare staff who
advertise their own availability (very common in Iranian medical
channels, e.g. "دندانپزشک آماده همکاری… ۵۰٪ تسویه"). These have no
facility; the contact phone is the key field.
- Model: TalentListing (role, person name, years, licensed, city/district,
area note, availability, gender, comp, phone) + ListingKind.Talent +
RawListing.LinkedTalentId + DbSet/relations/indexes + EF migration.
- Parser: detect آمادهبهکار/جویای کار → Kind=Talent; extract person name,
years of experience, licensed flag, area («منطقه ۱»), phone. Facility
name extraction now skipped for talent.
- Validator: talent path scores role + phone + medical (no facility/pay
required).
- Ingestion auto-publish: creates a TalentListing for talent kind.
- Review (manual publish): Talent option + talent fields; publishes a
TalentListing without a facility. Shift/Job facility now falls back to a
shared «نامشخص / ثبت نشده» record when the ad names none — publishing
never fails on a missing facility.
- Browse /Talent (indexable, filters: city/district/role/gender),
details /Talent/Details (noindex — personal contact, tel: call button),
_TalentCard, badge-talent, nav link, home section.
- Sitemap includes /Talent; robots disallows /Talent/Details. Archiver
expires stale talent listings.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When publishing a scraped listing we now look for a facility we already
have that is exactly or closely the same, and only create a new one when
there is no match — avoiding duplicates like «بیمارستان میلاد» vs «میلاد».
- ListingParser: extract a facility name (keyword + distinctive words) from
the post and surface it in the parser notes.
- FacilityMatcher: Persian-aware normalization (ي/ك, ZWNJ, punctuation),
type-word stripping for a "core" name, contains + Levenshtein similarity,
and FindBest (same-city exact → any-city exact → same-city fuzzy → fuzzy).
- Review (manual publish): auto-select a matching facility or prefill the
new-facility name; resolve-or-create uses fuzzy match; dropdown preselects.
- IngestionService (auto-publish): reuse FacilityMatcher against a run-wide
facility list (grows as new ones are created) instead of exact-name only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each ingestion run now records an IngestionRun row (found/queued/published/flagged/spam/duplicates + a per-source detail string). Admin → صف آگهیها shows a «تاریخچه جمعآوری» table of the last 15 runs (hover a row for the per-source breakdown), so admins can see how much each source found vs added over time. IngestionSummary gains TotalFetched. Migration: IngestionRuns table.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add «تست اتصال VPN/پروکسی» (reaches a filtered site through the proxy and reports connected/latency) and «تست هوش مصنوعی» (sends a sample post through the configured model and shows the verdict + extracted fields) to admin Settings. Fix OpenAiCompatibleAuditor.ParseVerdict: TryGetInt32/64 threw on null/string JSON values (the model commonly returns payAmount/sharePercent as null), which silently failed every audit — now guarded on ValueKind==Number. Verified the real OpenAI key extracts perfectly (approve / role=پرستار / city=تهران / shift=night).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add AiUseProxy setting + a toggle in the AI settings section. ScrapeHttpClients.ForAi(settings) returns a proxied HttpClient (reusing IngestProxyUrl, 100s timeout) when AiUseProxy is on, otherwise direct; AI-cache keys are protected from the scrape-client cleanup. OpenAiCompatibleAuditor now uses it, so the AI auditor (e.g. api.openai.com) is reachable through the same Xray sidecar that serves Telegram. Migration adds the column.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Analyzed live Divar (POST search) and Medjobs (ad_listing sitemaps) data — both are free Persian text. Tighten the medical-relevance gate (drop generic «استخدام»/«شیفت» that match retail/restaurant ads; add clinical terms: بهیار/اتاق عمل/بیهوشی/رادیولوژی/آزمایشگاه/دیالیز/فوریت/تریاژ/… ) so off-topic Divar jobs get flagged, not treated as medical. Add clinical role synonyms in the heuristic parser (بهیار/کمکپرستار/سالمند→پرستار, اتاق عمل→تکنسین اتاق عمل, فوریت→فوریتهای پزشکی, آزمایشگاه→کارشناس آزمایشگاه, فوقتخصص→پزشک متخصص…). Result on live data: Medjobs now yields ~9/30 queue-ready healthcare listings; Divar correctly flags ~72/75 noise for manual review.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>