Spaces:

JimLin0704
/

Crawl4AI

Sleeping

App Files Files Community

Crawl4AI / docs /md_v3 /tutorials /hooks-custom.md

amaye15

test

03c0888 10 months ago

preview code

raw

history blame contribute delete

13.9 kB

	# Hooks & Custom Code

	Crawl4AI supports a hook system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like:

	- Authentication (log in before navigating)
	- Content manipulation (modify HTML, inject scripts, etc.)
	- Session or browser configuration (e.g., adjusting user agents, local storage)
	- Custom data collection (scrape extra details or track state at each stage)

	In this tutorial, you’ll learn about:

	1. What hooks are available
	2. How to attach code to each hook
	3. Practical examples (auth flows, user agent changes, content manipulation, etc.)

	> Prerequisites
	> - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md).
	> - Comfortable with Python async/await.

	---

	## 1. Overview of Available Hooks

	\| Hook Name \| Called When / Purpose \| Context / Objects Provided \|
	\|--------------------------\|-----------------------------------------------------------------\|-----------------------------------------------------\|
	\| `on_browser_created` \| Immediately after the browser is launched, but before any page or context is created. \| Browser object only (no `page` yet). Use it for broad browser-level config. \|
	\| `on_page_context_created` \| Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. \| Typically provides `page` and `context`. \|
	\| `on_user_agent_updated` \| Whenever the user agent changes. For advanced user agent logic or additional header updates. \| Typically provides `page` and updated user agent string. \|
	\| `on_execution_started` \| Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. \| Typically provides `page`, possibly `context`. \|
	\| `before_goto` \| Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. \| Typically provides `page`, `context`, and `goto_params`. \|
	\| `after_goto` \| Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. \| Typically provides `page`, `context`, `response`. \|
	\| `before_retrieve_html` \| Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). \| Typically provides `page` or final HTML reference. \|
	\| `before_return_html` \| Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. \| Typically provides final HTML or a `page`. \|

	### A Note on `on_browser_created` (the “unbrowser” hook)
	- No `page` object is available because no page context exists yet. You can, however, set up browser-wide properties.
	- For example, you might control [CDP sessions][cdp] or advanced browser flags here.

	---

	## 2. Registering Hooks

	You can attach hooks by calling:

	```python
	crawler.crawler_strategy.set_hook("hook_name", your_hook_function)
	```

	or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor:

	```python
	hooks = {
	"before_goto": my_before_goto_hook,
	"after_goto": my_after_goto_hook,
	# ... etc.
	}
	async with AsyncWebCrawler(hooks=hooks) as crawler:
	...
	```

	### Hook Signature

	Each hook is a function (async or sync, depending on your usage) that receives certain parameters—most often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing.

	---

	## 3. Real-Life Examples

	Below are concrete scenarios where hooks come in handy.

	---

	### 3.1 Authentication Before Navigation

	One of the most frequent tasks is logging in or applying authentication before the crawler navigates to a URL (so that the user is recognized immediately).

	#### Using `before_goto`

	```python
	import asyncio
	from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

	async def before_goto_auth_hook(page, context, goto_params, **kwargs):
	"""
	Example: Set cookies or localStorage to simulate login.
	This hook runs right before page.goto() is called.
	"""
	# Example: Insert cookie-based auth or local storage data
	# (You could also do more complex actions, like fill forms if you already have a 'page' open.)
	print("[HOOK] Setting auth data before goto.")
	await context.add_cookies([
	{
	"name": "session",
	"value": "abcd1234",
	"domain": "example.com",
	"path": "/"
	}
	])
	# Optionally manipulate goto_params if needed:
	# goto_params["url"] = goto_params["url"] + "?debug=1"

	async def main():
	hooks = {
	"before_goto": before_goto_auth_hook
	}

	browser_cfg = BrowserConfig(headless=True)
	crawler_cfg = CrawlerRunConfig()

	async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
	result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg)
	if result.success:
	print("[OK] Logged in and fetched protected page.")
	else:
	print("[ERROR]", result.error_message)

	if __name__ == "__main__":
	asyncio.run(main())
	```

	Key Points
	- `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself.
	- If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start.

	---

	### 3.2 Setting Up the Browser in `on_browser_created`

	If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use `on_browser_created`. No `page` is available yet, but you can set up the browser instance itself.

	```python
	async def on_browser_created_hook(browser, **kwargs):
	"""
	Runs immediately after the browser is created, before any pages.
	'browser' here is a Playwright Browser object.
	"""
	print("[HOOK] Browser created. Setting up custom stuff.")
	# Possibly connect to DevTools or create an incognito context
	# Example (pseudo-code):
	# devtools_url = await browser.new_context(devtools=True)

	# Usage:
	async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler:
	...
	```

	---

	### 3.3 Adjusting Page or Context in `on_page_context_created`

	If you’d like to set default timeouts or inject scripts right after a page context is spun up:

	```python
	async def on_page_context_created_hook(page, context, **kwargs):
	print("[HOOK] Page context created. Setting default timeouts or scripts.")
	await page.set_default_timeout(20000) # 20 seconds
	# Possibly inject a script or set user locale

	# Usage:
	hooks = {
	"on_page_context_created": on_page_context_created_hook
	}
	```

	---

	### 3.4 Dynamically Updating User Agents

	`on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging:

	```python
	async def on_user_agent_updated_hook(page, context, new_ua, **kwargs):
	print(f"[HOOK] User agent updated to {new_ua}")
	# Maybe add a custom header based on new UA
	await context.set_extra_http_headers({"X-UA-Source": new_ua})

	hooks = {
	"on_user_agent_updated": on_user_agent_updated_hook
	}
	```

	---

	### 3.5 Initializing Stuff with `on_execution_started`

	`on_execution_started` runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp).

	```python
	async def on_execution_started_hook(page, context, **kwargs):
	print("[HOOK] Execution started. Setting a start timestamp or logging.")
	context.set_default_navigation_timeout(45000) # 45s if your site is slow

	hooks = {
	"on_execution_started": on_execution_started_hook
	}
	```

	---

	### 3.6 Post-Processing with `after_goto`

	After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials:

	```python
	async def after_goto_hook(page, context, response, **kwargs):
	"""
	Called right after page.goto() finishes, but before the crawler extracts HTML.
	"""
	if response and response.ok:
	print("[HOOK] After goto. Status:", response.status)
	# Maybe remove popups or check if we landed on a login failure page.
	await page.evaluate("""() => {
	const popup = document.querySelector(".annoying-popup");
	if (popup) popup.remove();
	}""")
	else:
	print("[HOOK] Navigation might have failed, status not ok or no response.")

	hooks = {
	"after_goto": after_goto_hook
	}
	```

	---

	### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html`

	Sometimes you need to tweak the page or raw HTML right before it’s captured.

	```python
	async def before_retrieve_html_hook(page, context, **kwargs):
	"""
	Modify the DOM just before the crawler finalizes the HTML.
	"""
	print("[HOOK] Removing adverts before capturing HTML.")
	await page.evaluate("""() => {
	const ads = document.querySelectorAll(".ad-banner");
	ads.forEach(ad => ad.remove());
	}""")

	async def before_return_html_hook(page, context, html, **kwargs):
	"""
	'html' is the near-finished HTML string. Return an updated string if you like.
	"""
	# For example, remove personal data or certain tags from the final text
	print("[HOOK] Sanitizing final HTML.")
	sanitized_html = html.replace("PersonalInfo:", "[REDACTED]")
	return sanitized_html

	hooks = {
	"before_retrieve_html": before_retrieve_html_hook,
	"before_return_html": before_return_html_hook
	}
	```

	Note: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override.

	---

	## 4. Putting It All Together

	You can combine multiple hooks in a single run. For instance:

	```python
	import asyncio
	from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

	async def on_browser_created_hook(browser, **kwargs):
	print("[HOOK] Browser is up, no page yet. Good for broad config.")

	async def before_goto_auth_hook(page, context, goto_params, **kwargs):
	print("[HOOK] Adding cookies for auth.")
	await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}])

	async def after_goto_log_hook(page, context, response, **kwargs):
	if response:
	print("[HOOK] after_goto: Status code:", response.status)

	async def main():
	hooks = {
	"on_browser_created": on_browser_created_hook,
	"before_goto": before_goto_auth_hook,
	"after_goto": after_goto_log_hook
	}

	browser_cfg = BrowserConfig(headless=True)
	crawler_cfg = CrawlerRunConfig(verbose=True)

	async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
	result = await crawler.arun("https://example.com/protected", config=crawler_cfg)
	if result.success:
	print("[OK] Protected page length:", len(result.html))
	else:
	print("[ERROR]", result.error_message)

	if __name__ == "__main__":
	asyncio.run(main())
	```

	This example:

	1. `on_browser_created` sets up the brand-new browser instance.
	2. `before_goto` ensures you inject an auth cookie before accessing the page.
	3. `after_goto` logs the resulting HTTP status code.

	---

	## 5. Common Pitfalls & Best Practices

	1. Hook Order: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic.
	2. Async vs Sync: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`.
	3. Mutating goto_params: `goto_params` is a dict that eventually goes to Playwright’s `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully.
	4. Browser vs Page vs Context: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to `browser`.
	5. Avoid Overdoing It: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice.

	---

	## Conclusion & Next Steps

	Hooks let you bend Crawl4AI to your will:

	- Authentication (cookies, localStorage) with `before_goto`
	- Browser-level config with `on_browser_created`
	- Page or context config with `on_page_context_created`
	- Content modifications before capturing HTML (`before_retrieve_html` or `before_return_html`)

	Where to go next:

	- [Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md): Combine hooks with advanced user simulation to avoid bot detection.
	- [Reference → AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md): Learn more about how hooks are implemented under the hood.
	- [How-To Guides](../../how-to/): Check short, specific recipes for tasks like scraping multiple pages with repeated “Load More” clicks.

	With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines!

	Last Updated: 2024-XX-XX