Discussion about this post

User's avatar
henjin's avatar

I recently found new evidence that ZC45 and ZXC21 may have a dubious origin. The NCBI's sequence read archive has an "Analysis" tab which shows a breakdown of what percentage of reads are estimated to come from different organisms. It was implemented using a software called STAT which searches for matches to 32-base k-mers. The database which contains the results of the STAT k-mer analysis is about 500 GB in size, but it can be searched through BigQuery at the Google Cloud Platform: https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud-based-taxonomy-analysis-table/.

I used BigQuery to search for runs which had over a thousand k-mer matches to sarbecoviruses: ```select * from `nih-sra-datastore.sra.metadata` as m, `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax where m.acc=tax.acc and tax_id=2697049 and total_count>1000 order by releasedate```. I found 384 runs from 74 unique studies that were older than the first SARS 2 runs from January 2020: https://cdn.discordapp.com/attachments/1093243194231246934/1108414826352484383/bigquery-sarbecovirus-1k.json.

The most interesting result was a metagenomic sequencing run which matched ZXC21 and which seems to have been overlooked earlier by COVID researchers: https://www.ncbi.nlm.nih.gov/sra/?term=SRR5351760. It was titled "metagenomic sequencing of bat metagenome:zhoushan intestine", it was submitted by "Research Institute for Medicine of Nanjing Command", and the samples are supposed to have been collected between July and August 2015, which matches the collection date of ZC45 and ZXC21 at GenBank. And the study of the run was by the same authors who published the paper about ZC45 and ZXC21.

When I aligned the reads in the SRA run against a FASTA file of sarbecoviruses, I got 78% coverage for ZXC21 with an average error rate of 0.8%, I got 16% coverage for ZC45 with an average error rate of 1.6%, and I got below 3% coverage for all other viruses (so I'm not sure if the sample contained both ZXC21 and ZC45 or only ZXC21, since some reads from ZXC21 probably also matched ZC45):

curl https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh|sh

esearch -db nuccore -query '(sarbecovirus "complete genome" NOT txid2697049) OR NC_045512.2'|efetch -format fasta>sarbe.fa # `NOT txid2697049` excludes SARS 2 and `OR NC_045512.2` includes Wuhan-Hu-1

brew install seqkit bowtie2 samtools gnu-sed

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR535/000/SRR5351760/SRR5351760_{1,2}.fastq.gz # download from ENA because it's faster than downloading from NCBI

seqkit fx2tab sarbe.fa|grep -Ev 'RfGB02|Rs7907'|sed $'s/A*\t$//'|seqkit tab2fx>sarbe2.fa # remove poly(A) tails, omit Rs7907 which has a segment of mammalian rRNA at the beginning, and omit RfGB02 which has an assembly error at the end

bowtie2-build --threads 3 sarbe2.fa{,}

bowtie2 -p3 -x sarbe2.fa -1 SRR5351760_1.fastq.gz -2 SRR5351760_2.fastq.gz --no-unal|samtools sort -@2 ->SRR5351760.bam

samtools coverage SRR5351760.bam|awk \$4|cut -f1,3-6|(gsed -u '1s/$/\terr%\tname/;q';sort -rnk4|awk -F\\t -v OFS=\\t 'NR==FNR{a[$1]=$2;next}{print$0,sprintf("%.2f",a[$1])}' <(samtools view SRR5351760.bam|awk -F\\t '{x=$3;n[x]++;len[x]+=length($10);sub(/.*NM:i:/,"");mis[x]+=$1}END{for(i in n)print i"\t"100*mis[i]/len[i]}') -|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print$0"\t"a[$1]}' <(seqkit seq -n sarbe2.fa|gsed 's/ /\t/;s/, .*//') -)|column -ts$'\t'

There's also a little-known public API endpoint which shows the number of K-mers for an SRA run which match each species and each taxonomical node: https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/. It shows that for the run SRR5351760 which matched ZXC21, there's a total of 415,912 assigned K-mers, but only 14,488 of them were assigned to bacteria and only 74 to the order of Enterobacterales, and there's zero k-mers assigned to Escherichia coli or to the genus Escherichia:

curl 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?acc=SRR5351760&cluster_name=public' >SRR5351760.stat

jq -r '.[]|.tax_table|sort_by(-.total_count)[]|(.total_count|tostring)+";"+.org+";"+(.tax_id|tostring)+";"+(.parent|tostring)' SRR5351760.stat

The metagenomic sequencing run for RaTG13 is also supposed to represent a "fecal swab", but it's inconsistent with how the sample contains very few reads which match intestinal bacteria. See this preprint by Steve Massey: https://arxiv.org/abs/2111.09469.

You can compare the results to the fecal metagenomic sample of RmYN02: https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?acc=SRR12464727&cluster_name=public. There's 26,684,391 identified reads out of which 18,526,075 are classified under bacteria, so it's a total of about 69% as opposed to about 3% for the ZXC21 sample or about 0.6% for the RaTG13 sample. And there's 368,425 k-mers assigned to Enterobacterales and 5,886 to Escherichia.

Expand full comment

No posts