The Regex-Redux is part of the Computer Language Benchmarks Game (CLBG), designed to measure the performance of different programming languages in handling regular expressions and string processing tasks. It specifically deals with pattern matching and substitution in DNA sequences.
Official CLBG Regex-Redux Page
- Input Reading: Read the DNA sequence from a FASTA format file and record the initial length.
- Sequence Cleaning: Remove descriptions and line breaks using regular expressions and record the cleaned sequence length.
- Pattern Matching: Count occurrences of DNA 8-mer patterns and their reverse complements.
- Pattern Substitution: Sequentially apply a series of substitution patterns.
- Output Results: Print the pattern counts and the sequence lengths.
agggtaaa|tttaccct[cgt]gggtaaa|tttaccc[acg]a[act]ggtaaa|tttacc[agt]tag[act]gtaaa|tttac[agt]ctagg[act]taaa|ttta[agt]cctaggg[acg]aaa|ttt[cgt]ccctagggt[cgt]aa|tt[acg]accctagggta[cgt]a|t[acg]taccctagggtaa[cgt]|[acg]ttaccct
- Replace
tHa[Nt]with<4> - Replace
aND|caN|Ha[DS]|WaSwith<3> - Replace
a[NSt]|BYwith<2> - Replace
<[^>]*>with| - Replace
\|[^|][^|]*\|with-
- Time Complexity: Dependent on the regular expression engine. Simple patterns can be processed in O(n) time, while complex patterns may require O(n²) due to backtracking.
- Space Complexity: O(n) where
nis the size of the input DNA sequence.
function regex_redux(file_path):
# Step 1: Read input
sequence = read_file(file_path)
initial_length = length(sequence)
# Step 2: Remove descriptions and line breaks
sequence = regex_replace(">.*\n|\n", "", sequence)
cleaned_length = length(sequence)
# Step 3: Count pattern occurrences
patterns = [
'agggtaaa|tttaccct',
'[cgt]gggtaaa|tttaccc[acg]',
'a[act]ggtaaa|tttacc[agt]t',
'ag[act]gtaaa|tttac[agt]ct',
'agg[act]taaa|ttta[agt]cct',
'aggg[acg]aaa|ttt[cgt]ccct',
'agggt[cgt]aa|tt[acg]accct',
'agggta[cgt]a|t[acg]taccct',
'agggtaa[cgt]|[acg]ttaccct'
]
for pattern in patterns:
count = regex_count(pattern, sequence)
print(pattern, count)
# Step 4: Apply substitutions
substitutions = [
('tHa[Nt]', '<4>'),
('aND|caN|Ha[DS]|WaS', '<3>'),
('a[NSt]|BY', '<2>'),
('<[^>]*>', '|'),
('\\|[^|][^|]*\\|', '-')
]
for pattern, replacement in substitutions:
sequence = regex_replace(pattern, replacement, sequence)
# Step 5: Output final lengths
substituted_length = length(sequence)
print(initial_length)
print(cleaned_length)
print(substituted_length)- Input Handling: Efficiently read the entire FASTA file.
- Regex Operations: Use language-specific regex libraries for counting and substitution.
- Optimized Looping: Process patterns sequentially.
- Output Handling: Store and print results in the required format.
- Regex-Redux Benchmark Description (CLBG)
- Regular Expression Documentation (Python): https://docs.python.org/3/library/re.html
- Computer Language Benchmark Game (CLBG): https://benchmarksgame-team.pages.debian.net