Scale or Fail? The QA Leader’s Checklist for Putting AI to the Test

Don't let AI apps and agents fail under load. Set your innovation up for success with proper scale testing. [Image: Adobe Stock]

Scale testing is more than just validating functionality. It can make or break the longevity, quality, and security of your AI agents and applications

Vaishnavi Reddi

October 31, 2025 4 min read

For Quality Assurance (QA) leaders, performance testing is more than just validating functionality. Whether you’re preparing for a large-scale product launch, or seasonal traffic surges — there’s a lot at stake. Downtime could mean performance slowdowns, system outages, lost revenue, and unhappy customers. .

That’s why, we’ve put together the ultimate scale testing guide. Here’s how you can ensure your AI agents and apps are built to scale.

Step 0: Identify hotspots

A powerful performance test strategy begins by looking at your production data. Understanding the hotspots — your slowest and most-used components — is critical to ensure your tests prioritize the areas that are most likely to fail under load, maximizing the return on your testing investment and to proactively eliminate critical production risks.

Scale Test’s Test Plan Creation feature gives you this insight by analyzing live production data to pinpoint your slowest and most-used pages, APIs, and Lightning components. To uncover performance bottlenecks, run this report during peak hours.

Instead of testing isolated pages or APIs, the insights from Plan Creation enables you to build realistic, end-to-end user journeys. For example, a journey could start on the Home Page, navigate to Account View, and then perform an Opportunity Update. This approach ensures your tests reflect real-world workflows and focus on the most critical parts of your application.

Step 1: Test securely in a sandbox

By creating a copy of your production environment, teams can build, test, and refine new features, code, and integrations — all without affecting live data and operations. A full copy sandbox gives admins and developers a secure environment to experiment and innovate.

How you set up your testing environment matters. When you’re setting up your sandbox, you can follow these guidelines for quality testing.

Trial Runs for Right-Sizing: Start with low-user baseline runs (20-30 users) to validate scripts and compare against production peaks using Scale Test’s Trial Accuracy Checker.
Align NFRs with SLAs: Define measurable targets, like “Page load ≤ 2s” or “Error rate ≤ 0.5%”. This establishes clear pass/fail criteria.
Mitigate Integration Risks: Even if third-party integrations are out of scope, simulate them using network mocking.

Your Salesforce Sandbox Awaits

Learn how and when to use different Salesforce developer environments on Trailhead, Salesforce’s free online learning platform.

Start learning

Step 3: Realistic scale tests = real results

The best tests are life-like. “Workload modeling” is the process of creating a realistic simulation of how real users interact with an agent, app, or system. By replicating the actual load and usage patterns, you can be confident that your results are accurate.

Here are key strategies for better workload modeling:

Ditch static user counts. Instead, drive your load tests using requests per second (RPS) or transactions per second (TPS). This approach creates a more consistent, production-like simulation.
Act like a real user. Don’t make your virtual users robots. Give them realistic behaviors by using random distributions for things like think times, pacing, and ramp-up rates.
Get precise with your data. Make sure you handle all the dynamic values like session IDs and tokens. This ensures each virtual user session is unique and realistic, just like in the real world.
Plan your ramp-up. A controlled ramp-up (and ramp-down) is crucial for uncovering issues without overwhelming your system from the start.

Sample ramp plan for 5,000 users:

0 – 15 minutes	Ramp from 0 to 1,000 users → hold for stability
30 – 45 minutes	Ramp to 2,500 users → hold again
60 – 90 minutes	Ramp to 5,000 users
90 – 150 minutes	Hold steady at 5,000 users (collect peak metrics)
150 – 180 minutes	Gradually ramp down to zero

To uncover systemic risks, go beyond single-hour loads. You can consider performing “endurance testing”, which entails running sustained, production-like load tests for 4+ hours to identify long-term issues such as memory leaks. Throughout the testing process, document ownership and escalation paths if SLAs are breached.

Step 4: Gather the data

Once your testing environment and conditions are set, you’re ready to start collecting data. When you start seeing consistent test results, it means you’ve achieved high-quality test data.

Here’s how you can yield helpful, relevant results:

Automated Data Creation: Move quickly by seeding realistic (not real) data into your sandbox. Use Salesforce CLI, Apex scripts, or Salesforce’s Data Seed to programmatically create thousands of users or records.
Automated Cleanup: Pair creation with cleanup scripts to reset the environment to an idle state after each run.
Data Security: By using data masking tools, you can protect personally identifiable information (PII) by transforming it into realistic, non-sensitive values that behave like the real thing but keep private details hidden. Explore tools like Salesforce’s Data Mask, or anonymization techniques when using production-like datasets.

Innovation or security? Yes.

Accelerate your release cycles, without sacrificing security with Data Mask & Seed.

Step 5: Results are in, insights are out

The real action begins once you have the data. To translate test results into performance improvements, you need the right tools and focus areas for analysis. This is how you access actionable metrics:

Front-End Visibility with LEX Insight: Track Experience Page Time (EPT) and percentile metrics (p75, p90) to uncover UI bottlenecks.
Live test monitoring with Scale Test: View runtime metrics such as DMLs, platform events, RPS graphs, and Apex hotspots. Visit ApexGuru for optimization opportunities in your Apex Code around unused classes, expensive methods, test cases and more.
Focus on Percentiles: Prioritize P90/P95 response times over averages to better represent real-world user experiences.

Your next steps are in the data. Link test outcomes directly to business risks and functional requirements so stakeholders see what has been validated.

Time to scale test

Enterprise-scale QA requires more than just simulating load — it demands a disciplined, production-informed approach.By embedding these practices into your QA process, and using tools like Scale Test, you can be confident that you’re deploying only the most high quality and secure agents and apps for your business.