Reverse Engineering YC's job directory aka Work At A Startup
This tutorial demonstrates advanced web scraping techniques including CSRF token management and session handling. Use responsibly and respect rate limits.
If you have ever explored startup job opportunities, you might have come across Work at a Startup, a platform that aggregates job postings from startups. While working on a side project, I wanted to automate the process of extracting structured data from this platform to analyze job trends and hiring patterns.
This project required understanding both front-end data structures and back-end API patterns to build a robust scraping solution.
In this blog, I document my journey of reverse engineering the platform to extract job data using Python and JSON.
Step 1: Exploring the Network Requests
The first step was to identify how the data is fetched and displayed on the platform. By inspecting the network requests in Chrome DevTools, I discovered that the job data is loaded dynamically via JSON files. This was a great starting point for building my script.
- Open Chrome DevTools and navigate to the Network tab
- Filter by 'Fetch/XHR' to see API calls
- Look for JSON responses that contain job data
- Identify patterns in request headers and parameters
Pro Tip: Use Chrome DevTools Network tab with "Fetch/XHR" filter to easily identify API endpoints.
Step 2: Understanding the Algolia Query
While analyzing the network requests, I discovered that the platform uses Algolia, a popular search infrastructure platform, to manage its job search index. The JSON payload sent to Algolia specifies attributes such as roles, locations, and job types. Here is an example of the query:
{
"requests": [
{
"indexName": "WaaSPublicCompanyJob_created_at_desc_production",
"params": "query=&page=0&filters=&attributesToRetrieve=%5B%22company_id%22%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&hitsPerPage=10&clickAnalytics=true&distinct=true"
}
]
}
This structured query allows for efficient filtering and retrieval of job data.
Step 3: Writing our own Queries
To interact with the Algolia API, I wrote a Python script to query the job search index. Here is the core logic:
import json
# Load the JSON file
with open('waas_jobs_2025-07-09.json', 'r') as file:
jobs = json.load(file)
# Extract and print job details
for job in jobs:
print(f"Company: {job['company']}")
print(f"Title: {job['title']}")
print(f"Location: {job['location']}")
print(f"Salary: {job['salary']}")
print(f"Hiring Manager: {job['hiring_manager']}")
print(f"Apply URL: {job['apply_url']}")
print("---")
This script reads the JSON file, iterates through the job postings, and prints the relevant details.
Step 4: CSRF Token Management and Session Handling
One of the biggest challenges when scraping Work at a Startup is handling authentication and CSRF (Cross-Site Request Forgery) protection. The platform implements robust security measures that require proper session management.
Security Note: CSRF tokens are security measures designed to protect against malicious requests. Always respect website terms of service and implement responsible scraping practices.
The Challenge
Initially, I was using hardcoded CSRF tokens and session cookies, which would expire after some time, causing the scraper to fail. This approach was fragile and required manual updates whenever tokens expired.
# Old approach - fragile and manual
COMPANY_HEADERS = {
"x-csrf-token": "hardcoded-token-that-expires",
# ... other headers
}
The Solution: Dynamic CSRF Management
To solve this, I implemented a WaaSClient class that automatically handles CSRF token extraction and session management, inspired by modern web scraping patterns:
class WaaSClient:
def __init__(self):
self.csrf_token = ""
self.session_cookies = {}
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
def init(self):
"""Extract CSRF token from homepage"""
resp = requests.get("https://www.workatastartup.com",
headers={"User-Agent": self.user_agent})
# Parse HTML to extract CSRF token
soup = BeautifulSoup(resp.text, 'html.parser')
csrf_meta = soup.find('meta', attrs={'name': 'csrf-token'})
self.csrf_token = csrf_meta['content']
# Save session cookies
for cookie_name, cookie_value in resp.cookies.items():
self.session_cookies[cookie_name] = cookie_value
def refresh_csrf(self):
"""Refresh expired CSRF token"""
resp = requests.post("https://www.workatastartup.com/verify-session",
headers={"X-CSRF-Token": self.csrf_token})
if resp.status_code == 422:
self.csrf_token = resp.text.strip()
def fetch(self, path, method="GET", **kwargs):
"""Make request with automatic CSRF handling"""
# First attempt
resp = requests.request(method, f"https://www.workatastartup.com{path}",
headers={"X-CSRF-Token": self.csrf_token}, **kwargs)
# If 422 error, refresh token and retry
if resp.status_code == 422:
self.refresh_csrf()
resp = requests.request(method, f"https://www.workatastartup.com{path}",
headers={"X-CSRF-Token": self.csrf_token}, **kwargs)
return resp
Key Benefits
- Automatic Token Extraction: Dynamically fetches CSRF tokens from the homepage
- Session Persistence: Maintains session cookies across requests
- Auto-Recovery: Automatically refreshes expired tokens and retries failed requests
- Fallback Support: Gracefully falls back to manual methods if initialization fails
This approach makes the scraper much more robust and eliminates the need for manual token management.
Step 5: Getting LinkedIn Profiles
Many job postings include information about the founders and hiring managers, along with their LinkedIn profiles. By extracting this data, I was able to create a list of profiles for networking and further analysis. Here is an example of the extracted data:
{
"name": "Dorothea Koh",
"linkedin": "https://www.linkedin.com/in/dotkoh/",
"bio": "Moon walker. Passionate about impacting healthcare in large emerging markets. ",
"past_schools": "Bioengineering at Stanford University; Biomedical Engineering, Economics at Northwestern University; BS, Biomedical Engineering, Economics at Northwestern University"
}
This information can be used to understand the background of key individuals at startups and build meaningful connections.
Closing Reflections
This project was a fascinating dive into reverse engineering and data analysis. By automating the extraction of job data with robust CSRF token management, I was able to build a reliable scraper that can handle long-running data collection tasks.
Key Learnings
- Security Matters: Modern web platforms implement sophisticated protection mechanisms
- Dynamic Extraction: Hardcoded tokens are fragile; dynamic extraction is more reliable
- Error Recovery: Automatic retry mechanisms make scrapers more robust
- Session Management: Proper cookie handling is crucial for sustained scraping
Next Steps
Future improvements could include:
- Building a web interface to visualize the job data and trends
- Implementing automated daily updates to track new job postings
- Adding rate limiting and respectful scraping practices
- Expanding the analysis to include salary trends and hiring patterns
- Creating alerts for specific job criteria or companies
Technical Stack
The final implementation uses:
- Python for core logic and data processing
- curl-cffi for browser-like HTTP requests with fingerprinting
- BeautifulSoup for HTML parsing and CSRF token extraction
- pandas for data manipulation and CSV export
- JSON for structured data storage
Happy scraping! 🚀
If you found this exploration helpful or have suggestions for improvements, feel free to reach out! The complete code demonstrates how to build production-ready web scrapers with proper session management.