Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using regex gives empty file #674

Open
dewshr opened this issue Jan 22, 2025 · 3 comments
Open

using regex gives empty file #674

dewshr opened this issue Jan 22, 2025 · 3 comments

Comments

@dewshr
Copy link

dewshr commented Jan 22, 2025

I am trying to extract the umi from 3 prime end, but it is giving me empty file. I tried with the example dataset provided in the umi-tools document. In the example file first sequence is CAGGTTCAATCTCGGTGGGACCTC, and i want to extract all the ones that start with G and end with any 5 bases. So, I used following code:

umi_tools extract --extract-method=regex --bc-pattern="(?P<umi_1>G.{5}$)" --stdin=example.fastq.gz --log=processed2.log --stdout=processed2.fastq.gz

@IanSudbery
Copy link
Member

IanSudbery commented Jan 22, 2025 via email

@dewshr
Copy link
Author

dewshr commented Jan 22, 2025

# UMI-tools version: 1.1.5
# output generated by extract --extract-method=regex --bc-pattern=(?P<umi_1>G.{5}$) --stdin=example.fastq.gz --log=processed2.log --stdout=processed2.fastq.gz
# job started at Wed Jan 22 13:24:25 2025 on noderome106 -- 18aded11-e49c-442c-8888-66a894e9b794
# pid: 4025370, system: Linux 4.18.0-477.58.1.el8_8.x86_64 #1 SMP Wed May 22 13:46:53 EDT 2024 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : regex
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# filter_umi                              : None
# filtered_out                            : None
# filtered_out2                           : None
# ignore_suffix                           : False
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : (?P<umi_1>G.{5}$)
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : None
# read2_out                               : False
# read2_stdout                            : False
# reads_subset                            : None
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='example.fastq.gz' encoding='ascii'>
# stdlog                                  : <_io.TextIOWrapper name='processed2.log' mode='a' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='processed2.fastq.gz' encoding='ascii'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_correct_log                         : None
# umi_separator                           : _
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# whitelist                               : None
2025-01-22 13:24:25,609 INFO Starting barcode extraction
2025-01-22 13:24:26,075 INFO Parsed 100000 reads
2025-01-22 13:24:26,472 INFO Parsed 200000 reads
2025-01-22 13:24:26,869 INFO Parsed 300000 reads
2025-01-22 13:24:27,268 INFO Parsed 400000 reads
2025-01-22 13:24:27,668 INFO Parsed 500000 reads
2025-01-22 13:24:28,067 INFO Parsed 600000 reads
2025-01-22 13:24:28,464 INFO Parsed 700000 reads
2025-01-22 13:24:28,865 INFO Parsed 800000 reads
2025-01-22 13:24:29,266 INFO Parsed 900000 reads
2025-01-22 13:24:29,666 INFO Parsed 1000000 reads
2025-01-22 13:24:30,067 INFO Parsed 1100000 reads
2025-01-22 13:24:30,463 INFO Parsed 1200000 reads
2025-01-22 13:24:30,859 INFO Parsed 1300000 reads
2025-01-22 13:24:31,259 INFO Parsed 1400000 reads
2025-01-22 13:24:31,660 INFO Parsed 1500000 reads
2025-01-22 13:24:32,058 INFO Parsed 1600000 reads
2025-01-22 13:24:32,456 INFO Parsed 1700000 reads
2025-01-22 13:24:32,854 INFO Parsed 1800000 reads
2025-01-22 13:24:33,254 INFO Parsed 1900000 reads
2025-01-22 13:24:33,654 INFO Parsed 2000000 reads
2025-01-22 13:24:34,054 INFO Parsed 2100000 reads
2025-01-22 13:24:34,454 INFO Parsed 2200000 reads
2025-01-22 13:24:34,854 INFO Parsed 2300000 reads
2025-01-22 13:24:35,253 INFO Parsed 2400000 reads
2025-01-22 13:24:35,651 INFO Parsed 2500000 reads
2025-01-22 13:24:36,052 INFO Parsed 2600000 reads
2025-01-22 13:24:36,454 INFO Parsed 2700000 reads
2025-01-22 13:24:36,852 INFO Parsed 2800000 reads
2025-01-22 13:24:37,252 INFO Parsed 2900000 reads
2025-01-22 13:24:37,651 INFO Parsed 3000000 reads
2025-01-22 13:24:38,050 INFO Parsed 3100000 reads
2025-01-22 13:24:38,452 INFO Parsed 3200000 reads
2025-01-22 13:24:38,695 INFO Input Reads: 3260437
2025-01-22 13:24:38,695 INFO regex does not match read1: 3260437
# job finished in 13 seconds at Wed Jan 22 13:24:38 2025 -- 13.65  0.34  0.00  0.00 -- 18aded11-e49c-442c-8888-66a894e9b794

@IanSudbery
Copy link
Member

Ah! The pattern matching uses match rather than search, so your pattern needs to be .+(?P<umi_1>G.{5}$).

Please leave this issue open as a reminder to me to update the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants