As mentioned at euler.stephan-brumme.com these tend to be harder, as they have their own judge system that actually runs programs, and therefore can test input multiple test cases against their reference implementation rather than just hard testing the result for a single input.
Goes only up to Project Euler problem 254 as of 2025, which had been published much much earlier, in 2009, so presumably they've stopped there.
New to topics? Read the docs here!