SV样本信息提取

感冒的梵高 • 2天前 • 杂文

提取样本信息

bcftools query -l germline.vcf >list

#######提取样本基因型

import sys
import csv

def parse_vcf(vcf_file):
    """解析VCF文件，提取样本基因型"""
    samples = []
    data = []
    
    with open(vcf_file, 'r') as f:
        for line in f:
            # 提取样本名称行
            if line.startswith('#CHROM'):
                fields = line.strip().split('/t')
                samples = fields[9:]  # 第10列及以后为样本名称
                continue
            # 跳过注释行
            if line.startswith('#'):
                continue
                
            # 解析变异行
            fields = line.strip().split('/t')
            chrom = fields[0]
            pos = fields[1]
            ref = fields[3]
            alt = fields[4]
            
            # 提取基因型信息
            genotypes = []
            for sample_data in fields[9:]:
                # 简单提取基因型（假设格式为GT:其他信息）
                gt_parts = sample_data.split(':')
                gt = gt_parts[0] if len(gt_parts) > 0 else '.'
                genotypes.append(gt)
            
            data.append([chrom, pos, ref, alt] + genotypes)
    
    return samples, data

def main():
    if len(sys.argv) != 3:
        print(f"用法: {sys.argv[0]}  ")
        sys.exit(1)
    
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    
    try:
        # 解析VCF文件
        samples, data = parse_vcf(input_file)
        
        # 写入CSV文件
        with open(output_file, 'w', newline='') as csvfile:
            writer = csv.writer(csvfile)
            # 写入表头
            header = ['CHROM', 'POS', 'REF', 'ALT'] + samples
            writer.writerow(header)
            # 写入数据
            writer.writerows(data)
        
        print(f"成功将{len(data)}条记录保存到{output_file}")
        
    except Exception as e:
        print(f"处理文件时出错: {e}")
        sys.exit(1)

if __name__ == "__main__":
    main()

0/0、0/1、1/1基因型的意义:

在遗传学和基因测序领域，0/0、0/1、1/1 是常见的基因型表示方法，通常出现在 VCF（Variant Call Format）文件中。这些符号表示个体在某个特定位点上的等位基因组合，具体含义如下：

基本概念
等位基因（Allele）：基因的不同变体。在二倍体生物（如人类）中，每个基因位点有两个等位基因，分别来自父母。
参考等位基因（Reference Allele）：基因组参考序列中该位点的碱基，通常用 0 表示。
替代等位基因（Alternative Allele）：与参考序列不同的变异碱基，通常用 1、2 等表示（如果有多个变异）。
基因型符号解析
0/0
含义：纯合参考基因型（Homozygous reference）。
解释：该位点的两个等位基因均与参考序列一致。例如，参考序列为 A，则个体的基因型为 AA。
0/1
含义：杂合基因型（Heterozygous）。
解释：一个等位基因与参考序列一致（0），另一个为变异等位基因（1）。例如，参考序列为 A，变异为 G，则个体的基因型为 AG。
1/1
含义：纯合变异基因型（Homozygous alternative）。
解释：该位点的两个等位基因均为变异等位基因。例如，参考序列为 A，变异为 G，则个体的基因型为 GG。

将csv文件中的基因型进行替换

import pandas as pd
import sys

def recode_genotypes(input_file, output_file):
    """将CSV文件中的基因型编码(0/0,0/1,1/1)替换为数值编码(0,2,1)"""
    # 定义替换映射
    genotype_map = {
        '0/0': '0',
        '0/1': '2',
        '1/1': '1',
        './.': '.'  # 缺失值保持不变
    }
    
    try:
        # 读取CSV文件
        df = pd.read_csv(input_file)
        
        # 对所有列应用替换
        for col in df.columns:
            df[col] = df[col].replace(genotype_map)
        
        # 保存结果
        df.to_csv(output_file, index=False)
        print(f"成功处理文件并保存到 {output_file}")
        
    except Exception as e:
        print(f"处理文件时出错: {e}")
        sys.exit(1)

def main():
    if len(sys.argv) != 3:
        print(f"用法: {sys.argv[0]}  ")
        sys.exit(1)
    
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    
    recode_genotypes(input_file, output_file)

if __name__ == "__main__":
    main()

版权声明：
作者：感冒的梵高
链接：https://www.techfm.club/p/214292.html
来源：TechFM
文章版权归作者所有，未经允许请勿转载。

THE END

二维码

《椰岛海风起—-写给三平》（46）

< <上一篇

写作|挑战码字失败后，重新打卡第五天。

下一篇>>

搜索内容

SV样本信息提取

将csv文件中的基因型进行替换

取消回复

共有 0 条评论

Ads