Research Crawling Engineer

<p style="min-height:1.5em"><strong>Who We Are:</strong></p><p style="min-height:1.5em">We build infrastructure that delivers massive amounts of web data to the companies training the world’s most powerful AI models.</p><p style="min-height:1.5em">We're the team that helps to power and support Grass, a bandwidth-sharing network that lets us operate a massive distributed crawler, giving us unique access to high-quality public web data at global scale. On top of that, we’ve built pipelines for ingesting, segmenting, and annotating billions of videos, transcripts, and audio files, powering dataset creation for frontier labs.</p><p style="min-height:1.5em">We’re lean, technical, and move fast. No red tape, no slow decision-making; just a team of builders pushing to expand what’s possible for open web data and AI.</p><p style="min-height:1.5em"><strong><u>Overview:</u></strong><br>As a Research Crawling Engineer, you will design and operate large-scale web data acquisition systems for research and model development. You will work will span distributed systems, scraping infrastructure, and data pipelines.<br><br><strong><u>Responsibilities:</u></strong></p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Build and maintain large-scale web crawlers across diverse domains</p></li><li><p style="min-height:1.5em">Design high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)</p></li><li><p style="min-height:1.5em">Handle anti-bot systems, rate limits, and dynamic/JS-heavy sites</p></li><li><p style="min-height:1.5em">Develop pipelines for cleaning, deduplication, filtering, and normalization</p></li><li><p style="min-height:1.5em">Construct and maintain datasets for research and model training</p></li><li><p style="min-height:1.5em">Monitor crawl performance, coverage, and data quality; iterate quickly</p></li><li><p style="min-height:1.5em">Collaborate with research teams to align data collection with modeling needs</p></li><li><p style="min-height:1.5em">Optimize infrastructure for cost, latency, and reliability</p></li></ul><p style="min-height:1.5em"><br><br><strong><u>Requirements:</u></strong></p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Strong programming experience in one or more of: Go, Rust, Python, Java, or C++</p></li><li><p style="min-height:1.5em">Experience building web crawlers or large-scale data pipelines</p></li><li><p style="min-height:1.5em">Solid understanding of HTTP, networking, and browser behavior</p></li><li><p style="min-height:1.5em">Familiarity with distributed systems and parallel processing</p></li><li><p style="min-height:1.5em">Experience working with large datasets (TB–PB scale preferred)</p></li></ul><p style="min-height:1.5em">Ability to debug unstable or adversarial environments<br></p><p style="min-height:1.5em"><strong><u>Preferred / Bonus:</u></strong></p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Experience with NLP pipelines or dataset curation for ML</p></li><li><p style="min-height:1.5em">Familiarity with LLM pretraining data or retrieval systems</p></li><li><p style="min-height:1.5em">Experience with headless browsers (e.g., Chrome DevTools Protocol, Playwright, Puppeteer)</p></li><li><p style="min-height:1.5em">Knowledge of proxy systems, IP rotation, and large-scale request orchestration</p></li><li><p style="min-height:1.5em">Background in data quality evaluation or benchmarking</p></li><li><p style="min-height:1.5em">Experience running workloads on cloud or bare-metal infrastructure<br></p></li></ul><p style="min-height:1.5em"><strong><u>What This Role Involves:</u></strong></p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Operating at the boundary of scale and reliability</p></li><li><p style="min-height:1.5em">Adapting to constantly changing web environments</p></li><li><p style="min-height:1.5em">Balancing throughput, coverage, and data quality</p></li><li><p style="min-height:1.5em">Owning end-to-end data acquisition pipelines<br></p></li></ul><p style="min-height:1.5em"><strong><u>Evaluation Criteria:</u></strong></p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Ability to design systems that scale without degrading quality</p></li><li><p style="min-height:1.5em">Practical problem-solving under real-world constraints</p></li><li><p style="min-height:1.5em">Speed of iteration and ownership</p></li><li><p style="min-height:1.5em">Measurable improvements in data coverage, quality, or efficiency<br></p></li></ul><p style="min-height:1.5em"><strong><u>Compensation:</u></strong></p><p style="min-height:1.5em">Based on experience and demonstrated ability to operate at scale<br><br><strong><u>Example Projects:</u></strong></p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Build a distributed crawler for a continuously updated, high-quality web project</p></li><li><p style="min-height:1.5em">Design a system to classify and filter billions of pages for pretraining</p></li><li><p style="min-height:1.5em">Extract structured data from dynamic, JS-heavy sites at scale</p></li><li><p style="min-height:1.5em">Improve deduplication and quality scoring across multimodal datasets</p></li></ul><p style="min-height:1.5em"><strong>Why Work With Us:</strong></p><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Opportunity. We are at the forefront of developing a web-scale crawler and knowledge graph that improves access to public web data and extends the value of AI to the people.</p></li><li><p style="min-height:1.5em">Culture. We're a lean team with a high bar. We come to work not to be comfortable, but to find out what we're capable of and to do work that matters. We're not calling for people who keep things moving. We're calling for people who make everyone around them better. <br>We prioritize low ego and high output. This is a fully remote team.</p></li><li><p style="min-height:1.5em">Compensation. You’ll receive a competitive salary, benefits and equity package.</p></li></ul>

Back to blog

Common Interview Questions And Answers

1. HOW DO YOU PLAN YOUR DAY?

This is what this question poses: When do you focus and start working seriously? What are the hours you work optimally? Are you a night owl? A morning bird? Remote teams can be made up of people working on different shifts and around the world, so you won't necessarily be stuck in the 9-5 schedule if it's not for you...

2. HOW DO YOU USE THE DIFFERENT COMMUNICATION TOOLS IN DIFFERENT SITUATIONS?

When you're working on a remote team, there's no way to chat in the hallway between meetings or catch up on the latest project during an office carpool. Therefore, virtual communication will be absolutely essential to get your work done...

3. WHAT IS "WORKING REMOTE" REALLY FOR YOU?

Many people want to work remotely because of the flexibility it allows. You can work anywhere and at any time of the day...

4. WHAT DO YOU NEED IN YOUR PHYSICAL WORKSPACE TO SUCCEED IN YOUR WORK?

With this question, companies are looking to see what equipment they may need to provide you with and to verify how aware you are of what remote working could mean for you physically and logistically...

5. HOW DO YOU PROCESS INFORMATION?

Several years ago, I was working in a team to plan a big event. My supervisor made us all work as a team before the big day. One of our activities has been to find out how each of us processes information...

6. HOW DO YOU MANAGE THE CALENDAR AND THE PROGRAM? WHICH APPLICATIONS / SYSTEM DO YOU USE?

Or you may receive even more specific questions, such as: What's on your calendar? Do you plan blocks of time to do certain types of work? Do you have an open calendar that everyone can see?...

7. HOW DO YOU ORGANIZE FILES, LINKS, AND TABS ON YOUR COMPUTER?

Just like your schedule, how you track files and other information is very important. After all, everything is digital!...

8. HOW TO PRIORITIZE WORK?

The day I watched Marie Forleo's film separating the important from the urgent, my life changed. Not all remote jobs start fast, but most of them are...

9. HOW DO YOU PREPARE FOR A MEETING AND PREPARE A MEETING? WHAT DO YOU SEE HAPPENING DURING THE MEETING?

Just as communication is essential when working remotely, so is organization. Because you won't have those opportunities in the elevator or a casual conversation in the lunchroom, you should take advantage of the little time you have in a video or phone conference...

10. HOW DO YOU USE TECHNOLOGY ON A DAILY BASIS, IN YOUR WORK AND FOR YOUR PLEASURE?

This is a great question because it shows your comfort level with technology, which is very important for a remote worker because you will be working with technology over time...