Spaces:
Runtime error
Runtime error
| ### Session Management | |
| Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for: | |
| - **Performing JavaScript actions before and after crawling.** | |
| - **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly. | |
| **Note:** This feature is designed for sequential workflows and is not suitable for parallel operations. | |
| --- | |
| #### Basic Session Usage | |
| Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`: | |
| ```python | |
| from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig | |
| async with AsyncWebCrawler() as crawler: | |
| session_id = "my_session" | |
| # Define configurations | |
| config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id) | |
| config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id) | |
| # First request | |
| result1 = await crawler.arun(config=config1) | |
| # Subsequent request using the same session | |
| result2 = await crawler.arun(config=config2) | |
| # Clean up when done | |
| await crawler.crawler_strategy.kill_session(session_id) | |
| ``` | |
| --- | |
| #### Dynamic Content with Sessions | |
| Here's an example of crawling GitHub commits across multiple pages while preserving session state: | |
| ```python | |
| from crawl4ai.async_configs import CrawlerRunConfig | |
| from crawl4ai.extraction_strategy import JsonCssExtractionStrategy | |
| from crawl4ai.cache_context import CacheMode | |
| async def crawl_dynamic_content(): | |
| async with AsyncWebCrawler() as crawler: | |
| session_id = "github_commits_session" | |
| url = "https://github.com/microsoft/TypeScript/commits/main" | |
| all_commits = [] | |
| # Define extraction schema | |
| schema = { | |
| "name": "Commit Extractor", | |
| "baseSelector": "li.Box-sc-g0xbh4-0", | |
| "fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}], | |
| } | |
| extraction_strategy = JsonCssExtractionStrategy(schema) | |
| # JavaScript and wait configurations | |
| js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();""" | |
| wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0""" | |
| # Crawl multiple pages | |
| for page in range(3): | |
| config = CrawlerRunConfig( | |
| url=url, | |
| session_id=session_id, | |
| extraction_strategy=extraction_strategy, | |
| js_code=js_next_page if page > 0 else None, | |
| wait_for=wait_for if page > 0 else None, | |
| js_only=page > 0, | |
| cache_mode=CacheMode.BYPASS | |
| ) | |
| result = await crawler.arun(config=config) | |
| if result.success: | |
| commits = json.loads(result.extracted_content) | |
| all_commits.extend(commits) | |
| print(f"Page {page + 1}: Found {len(commits)} commits") | |
| # Clean up session | |
| await crawler.crawler_strategy.kill_session(session_id) | |
| return all_commits | |
| ``` | |
| --- | |
| #### Session Best Practices | |
| 1. **Descriptive Session IDs**: | |
| Use meaningful names for session IDs to organize workflows: | |
| ```python | |
| session_id = "login_flow_session" | |
| session_id = "product_catalog_session" | |
| ``` | |
| 2. **Resource Management**: | |
| Always ensure sessions are cleaned up to free resources: | |
| ```python | |
| try: | |
| # Your crawling code here | |
| pass | |
| finally: | |
| await crawler.crawler_strategy.kill_session(session_id) | |
| ``` | |
| 3. **State Maintenance**: | |
| Reuse the session for subsequent actions within the same workflow: | |
| ```python | |
| # Step 1: Login | |
| login_config = CrawlerRunConfig( | |
| url="https://example.com/login", | |
| session_id=session_id, | |
| js_code="document.querySelector('form').submit();" | |
| ) | |
| await crawler.arun(config=login_config) | |
| # Step 2: Verify login success | |
| dashboard_config = CrawlerRunConfig( | |
| url="https://example.com/dashboard", | |
| session_id=session_id, | |
| wait_for="css:.user-profile" # Wait for authenticated content | |
| ) | |
| result = await crawler.arun(config=dashboard_config) | |
| ``` | |
| --- | |
| #### Common Use Cases for Sessions | |
| 1. **Authentication Flows**: Login and interact with secured pages. | |
| 2. **Pagination Handling**: Navigate through multiple pages. | |
| 3. **Form Submissions**: Fill forms, submit, and process results. | |
| 4. **Multi-step Processes**: Complete workflows that span multiple actions. | |
| 5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content. | |