Troubleshooting Frontier CI Build Failure Related To WRITE_BASIC_CONFIG_VERSION_FILE
Hey everyone! We've been wrestling with a tricky build failure in our Frontier CI setup, and I wanted to share the details, troubleshooting steps, and potential solutions we've explored. This issue falls under the StanfordLegion category, specifically affecting the legion. Let's dive in!
The Problem: WRITE_BASIC_CONFIG_VERSION_FILE()
Error
The core of the issue manifests as a CMake Error
during the build process. The error message points to a missing VERSION
specification for the WRITE_BASIC_CONFIG_VERSION_FILE()
function. Here's the exact error we're seeing:
CMake Error at /usr/share/cmake/Modules/WriteBasicConfigVersionFile.cmake:43 (message):
No VERSION specified for WRITE_BASIC_CONFIG_VERSION_FILE()
Call Stack (most recent call first):
/usr/share/cmake/Modules/CMakePackageConfigHelpers.cmake:239 (write_basic_config_version_file)
realm/cmake/RealmPackaging.cmake:101 (write_basic_package_version_file)
realm/CMakeLists.txt:1029 (include)
-- Configuring incomplete, errors occurred!
This error halts the configuration process, preventing the build from proceeding. You can see this in action in this Frontier CI job log: https://code.olcf.ornl.gov/ci/ums036/dev/legion/-/jobs/112856.
Replicating the Issue: A Step-by-Step Guide
Interestingly, this error isn't immediately apparent in all environments. I was able to reproduce it consistently only by grabbing the source repository directly from the Frontier CI environment and configuring it. This suggests that there might be some subtle differences in the environment or source state that trigger the problem.
To replicate the issue, follow these steps:
-
Download the problematic source: You can download the specific source tarball that triggers the error from this location: https://sapling2.stanford.edu/~eslaught/63519_112856.tar.gz
-
Extract the archive: Use the following command to extract the downloaded tarball:
tar xf 63519_112856.tar.gz
-
Create a build directory: Create a separate directory for the build process to keep things organized:
mkdir test_build cd test_build
-
Run CMake: Now, attempt to configure the build using CMake, pointing it to the extracted source directory:
cmake ../63519_112856
If you encounter the issue, you'll see the
CMake Error
described earlier.
Comparing Source Repositories: Hunting for Discrepancies
To understand what might be causing this, I compared the problematic source repository against a fresh clone from our GitLab repository. This involves cloning the repository, checking out a specific commit, and then using the diff
command to identify any differences.
Here's how you can perform the comparison:
-
Clone the repository: Clone the
StanfordLegion/legion
repository from GitLab:git clone git@gitlab.com:StanfordLegion/legion.git legion-2025-07-25
-
Checkout a specific commit: Navigate to the cloned repository and checkout the commit
5feb6ba91865523f4eb05300b6ebdcf5c5f37892
:cd legion-2025-07-25 git checkout 5feb6ba91865523f4eb05300b6ebdcf5c5f37892 cd ..
-
Compare directories: Use the
diff
command to compare the fresh clone with the extracted source from the tarball:diff -ru legion-2025-07-25 63519_112856
The output of this diff
command will highlight any differences between the two source trees, which might provide clues about the root cause of the build failure. You can also find the diff output attached as source_tarball_contents.patch
(https://github.com/user-attachments/files/21435928/source_tarball_contents.patch) for easier analysis.
Analysis of the Differences: What Did We Find?
So, we've replicated the issue and compared the source. Now, let's talk about what the diff actually shows. The diff
output reveals several key differences between the clean clone and the tarball extracted from the Frontier CI environment. These differences likely contribute to the CMake error. Pinpointing the exact root cause requires careful examination, but here's a breakdown of the types of changes we're seeing:
- File Modifications: Some files have been directly modified in the tarball compared to the clean clone. This is the most concerning type of difference, as it suggests that changes are being made to the source code within the CI environment, potentially due to patching, sed commands, or other build-time transformations. We need to understand why these modifications are happening and whether they are intentional.
- Timestamp Differences: Timestamps on some files differ. While this alone might not cause a build failure, it can be indicative of file modifications or other operations performed on the files. It's a signal to investigate further.
- File Permissions: There might be differences in file permissions. Again, this might not be the direct cause of the CMake error, but it's another piece of the puzzle that needs to be considered. Incorrect permissions can sometimes interfere with build processes.
- Missing Files: It's possible that some files are present in the clean clone but missing from the tarball, or vice-versa. This would definitely cause issues, as CMake might be expecting certain files to be present.
Key Areas to Investigate:
- Build Scripts: The first place to look is in the build scripts used by Frontier CI. These scripts are responsible for setting up the build environment, applying patches, and running CMake. We need to carefully review these scripts to see if any of them are inadvertently modifying the source code or causing files to be missed.
- CMake Configuration: The CMake configuration files themselves (
CMakeLists.txt
and related files) should be examined. We need to ensure that theWRITE_BASIC_CONFIG_VERSION_FILE()
function is being called correctly and that all required parameters, includingVERSION
, are being passed. - Environment Variables: Environment variables can influence the behavior of CMake and other build tools. We need to check the environment variables set in the Frontier CI environment and compare them to a working environment to see if there are any discrepancies.
Potential Solutions and Next Steps
Based on our analysis so far, here are some potential solutions and next steps we can take to address this build failure:
- Trace the Source Modifications: The most critical step is to trace exactly how and why the source files are being modified in the Frontier CI environment. We need to identify the script or process that's making these changes and determine if it's intentional or a bug. Tools like
git bisect
(though challenging in a CI environment) could potentially help pinpoint the change that introduced the issue. - Correct
WRITE_BASIC_CONFIG_VERSION_FILE()
Usage: If the source modifications are unintentional, we need to fix the build scripts to prevent them. If the modifications are intentional, we need to ensure that theWRITE_BASIC_CONFIG_VERSION_FILE()
function is being called correctly with theVERSION
parameter. This might involve adding aVERSION
definition in the CMakeLists.txt file or passing it as a variable. - Standardize Build Environment: Discrepancies in the build environment can lead to unexpected behavior. We should strive to standardize the build environment across different systems (local development, CI, etc.) to minimize the risk of environment-specific issues. This includes ensuring consistent versions of CMake, compilers, and other build tools.
- Improve CI Debugging: We need to improve our CI debugging capabilities. This might involve adding more logging, using more verbose build commands, or setting up a way to easily SSH into the CI environment to inspect the state of the build. Being able to directly examine the CI environment would significantly speed up troubleshooting.
- Consider CMake Version: CMake version incompatibilities can sometimes cause issues. While less likely, we should ensure that the CMake version used in Frontier CI is compatible with our project's CMake configuration. Upgrading or downgrading CMake might be necessary.
Let's Collaborate! (and ask for help)
This is a complex issue, and I'm sure there are other factors we haven't considered yet. I'm sharing this detailed breakdown so we can collaborate and brainstorm potential solutions. If you have any ideas, suggestions, or insights, please chime in! Let's work together to get these Frontier CI builds back on track. I will update as new information comes in!
Specifically, I'm wondering if anyone has encountered similar issues with CMake and the WRITE_BASIC_CONFIG_VERSION_FILE()
function before? Any tips or tricks for debugging these types of problems would be greatly appreciated!
I hope this helps in understanding the issue and figuring out a solution. Let me know if you have any questions or want to explore specific areas in more detail.