我最近完成了一个大批量规格数据缩减程序的编写,该程序将取代LabVIEW中编写的程序,然而,我正在将我的结果与LabVIEW中的结果进行比较,发现在非常基本的计算中存在0.01%-1%的差异,这让我挠头。
这两个程序都是从读取数据文件和导入每个质量的原始数据开始进行分析的。然后用线性回归拟合每批原始数据
t=0
。我们还计算不确定性、平均值、标准偏差等。
以下是原始数据文件的示例。请注意,原始强度数据有七个有效数字。
He 4420 CB-UV 2023-09-22T10:40:49 time_sec 2.02 3.02 4.00 5.25 40.02
23.295 9.387086E-11 1.674618E-13 -3.206630E-16 3.373704E-14 2.288376E-14
28.895 9.180904E-11 1.690122E-13 1.364953E-15 3.260850E-14 3.299508E-14
2.789 3.386823E-10 1.332874E-10 5.501391E-14 3.523762E-14 1.053051E-11
8.289 3.412776E-10 1.342065E-10 5.657902E-14 3.267960E-14 1.117998E-11
13.789 3.447253E-10 1.349348E-10 6.633267E-14 3.418557E-14 1.199392E-11
19.189 3.458286E-10 1.354857E-10 6.325153E-14 3.507785E-14 1.242906E-11
24.689 3.471647E-10 1.357906E-10 6.458323E-14 3.259249E-14 1.279006E-11
30.289 3.500519E-10 1.367072E-10 6.307386E-14 3.412402E-14 1.334153E-11
35.689 3.489635E-10 1.364054E-10 6.470872E-14 3.464260E-14 1.394222E-11
41.189 3.507102E-10 1.371554E-10 6.159706E-14 3.401083E-14 1.416036E-11
46.689 3.517030E-10 1.372364E-10 6.440496E-14 3.328290E-14 1.531714E-11
52.089 3.526178E-10 1.376700E-10 6.662408E-14 3.173344E-14 1.542191E-11
57.589 3.547731E-10 1.381437E-10 6.997605E-14 3.255263E-14 1.615525E-11
63.089 3.583457E-10 1.384707E-10 7.329222E-14 3.186776E-14 1.576015E-11
68.589 3.553523E-10 1.388127E-10 7.325879E-14 3.174203E-14 1.623024E-11
73.989 3.553024E-10 1.394373E-10 6.629945E-14 3.500163E-14 1.659189E-11
79.489 3.567244E-10 1.395758E-10 6.466660E-14 3.414220E-14 1.709554E-11
85.000 3.558431E-10 1.396115E-10 7.296217E-14 3.356015E-14 1.707882E-11
90.400 3.584943E-10 1.399037E-10 7.224490E-14 3.036364E-14 1.758635E-11
95.900 3.618750E-10 1.404177E-10 7.380499E-14 3.356473E-14 1.764968E-11
He 4420 CB-UV 2023-09-22T10:50:29
第一列数据为时间_sec,第二列数据为2.02 amu时的强度,第三列数据为3.02 amu的强度等。前两行数据为分析前基线测量值,随后为实际分析数据。
如果你要自己处理这些数字,请排除前两行。
在函数中导入数据并计算4/3比率
parse_raw()
:
def parse_raw(file_path):
# import the data file as df
df = pd.read_csv(file_path, sep='\t', header=1)
# check if the data file is empty
if df.shape[0] <= 3:
raise ValueError(f"File {file_path} contains no data.")
# grab unique analysis data from non-header 'headers'
helium_number = int(df.columns[0].split()[1])
analysis_label = str(df.columns[1])
timestamp = pd.to_datetime(df.columns[2], format='%Y-%m-%dT%H:%M:%S')
# grab the pre-analysis baseline and raw analysis data
PAB_data = df.iloc[0:2].dropna(axis=1).apply(pd.to_numeric)
raw_data = df.iloc[3:-1].dropna(axis=1).apply(pd.to_numeric)
# these data files are a mess, jesus, just rename the headers manually
data_cols = ['time_sec', '2 amu', '3 amu', '4 amu', '5 amu', '40 amu']
raw_data.columns = data_cols
PAB_data.columns = data_cols
# get timestamps from time_sec
time_sec = raw_data['time_sec']
# re-organize raw_data
org_data = {
"2 amu": raw_data['2 amu'],
"3 amu": raw_data['3 amu'],
"4 amu": raw_data['4 amu'],
"5 amu": raw_data['5 amu'],
"40 amu": raw_data['40 amu'],
"4/3 Ratio": pd.DataFrame(columns=['4/3 Ratio'])
}
# calculate 4/3 Ratio * 1000
org_data['4/3 Ratio'] = raw_data['4 amu'] / raw_data['3 amu'] * 1000
return helium_number, analysis_label, timestamp, time_sec, PAB_data, org_data
到了进行计算的时候,代码如下。注意过滤器
[active_data==1]
和
[m3/m4_active==1]
不过,为了简单起见,我不会在LabVIEW或Python中选择任何异常值。
import numpy as np
from scipy.stats import linregress
# Calculates statistics and stores them in the 'filtered_data[analysis]' object
# only works on one analysis at a time
def calculate_stats(analysis):
# calculate average of pre-analysis baseline data
for mass in ['2 amu', '3 amu', '4 amu', '5 amu', '40 amu']:
analysis.PAB_mean[mass] = np.mean(analysis.PAB_data[mass])
# calculate t0 intercepts for masses 2, 3, 4, and 40
for mass in ['2 amu', '3 amu', '4 amu', '40 amu']:
# exclude any outlier data
active_data = analysis.data_status[mass].to_numpy().flatten()
x = analysis.time_sec[active_data==1]
y = analysis.raw_data[mass][active_data==1]
# Linear regression with linregress
result = linregress(x, y)
analysis.t0_intercept[mass] = result.intercept
analysis.t0_uncertainty[mass] = result.intercept_stderr
analysis.t0_pctuncert[mass] = result.intercept_stderr / result.intercept * 100
analysis.t0_slope[mass] = result.slope
# ratio 4/3 intercept, uncertainty, percent uncertainty, and slope
analysis.ratio43_t0int = analysis.t0_intercept['4 amu'] / analysis.t0_intercept['3 amu'] * 1000
analysis.ratio43_uncert = analysis.ratio43_t0int * np.sqrt(
(analysis.t0_uncertainty['3 amu'] / analysis.t0_intercept['3 amu'])**2 +
(analysis.t0_uncertainty['4 amu'] / analysis.t0_intercept['4 amu'])**2)
analysis.ratio43_puncert = analysis.ratio43_uncert / analysis.ratio43_t0int * 100
# 4/3 slope calculation feels asinine but here we go
# todo: filter by active indices
fit_43_ratio_data = linregress(analysis.time_sec, analysis.raw_data['4 amu'] / analysis.raw_data['3 amu'])
analysis.ratio43_slope = fit_43_ratio_data.slope
# ratio 4/3 average, SD, %SD
m4_active = analysis.data_status['4 amu'].to_numpy().flatten()
m3_active = analysis.data_status['3 amu'].to_numpy().flatten()
analysis.averages['4/3 Ratio avg'] = np.mean(analysis.raw_data['4 amu'][m4_active==1] / analysis.raw_data['3 amu'][m3_active==1])
analysis.averages['4/3 Ratio std'] = np.std(analysis.raw_data['4 amu'][m4_active==1] / analysis.raw_data['3 amu'][m3_active==1])
analysis.averages['4/3 Ratio %SD'] = analysis.averages['4/3 Ratio std'] / analysis.averages['4/3 Ratio avg'] * 100
# calculate averages for masses 2, 3, 4, 5, and 40
for mass in ['2 amu', '3 amu', '4 amu', '5 amu', '40 amu']:
active_data = analysis.data_status[mass].to_numpy().flatten()
analysis.averages[f'{mass} avg'] = np.mean(analysis.raw_data[mass][active_data==1])
analysis.averages[f'{mass} std'] = np.std(analysis.raw_data[mass][active_data==1])
analysis.averages[f'{mass} %SD'] = (analysis.averages[f'{mass} std'] / analysis.averages[f'{mass} avg']) * 100
差不多就是这样。上面的脚本是整个程序中唯一可以进行任何计算的位置;也就是说,没有运行数千次算术运算的秘密脚本会导致舍入误差相加。
尽管如此,当我将我的结果与应该做完全相同事情的LabVIEW代码进行比较时,我发现了差异。现在,有很多方法可以找到直线截距上的误差,所以我对我们得出不同的不确定性并不感到惊讶,然而,不同的平均值??不同的标准偏差?事实上,这些不匹配告诉我有些事情是非常错误的。
例如,质量2数据(第二列,不包括前两行数据)的平均值为:
Python: 3.5234E-10
LabVIEW: 3.5160E-10
Excel: 3.5158E-10
Python和LabVIEW之间的平均值相差0.10%。我试图在Excel中仔细检查数学,结果得到了第三个值,这表明我的Python数学是错误的。
当谈到t0拦截时,我得到了以下信息:
mass 3 mass 4
Python: 1.341E-10 5.980E-14
LabVIEW: 1.339E-10 5.869E-14
Excel: 1.339E-10 5.869E-14
考虑到LabVIEW和Excel在这方面的一致性,很明显Python是这些差异背后的罪魁祸首。
我已经尽可能简单直接地计算了数学,那么为什么我会得到不准确的结果,我该怎么办才能解决这些问题呢?