目标

Merge the CO2_sensor_measurements.csv, temperature_humidity.csv, and sensors_metadata_updated.csv, into a single dataframe.

The merged dataframe contains:
- index: the time instance timestamp of the measurements
- columns: the location of the site LocationName, the sensor ID SensorUnit_ID, the CO2 measurement CO2, the temperature, the humidity, the zone, the altitude, the longitude lon and the latitude lat.

timestamp	LocationName	SensorUnit_ID	CO2	temperature	humidity	zone	altitude	lon	lat
…	…	…	…	…	…	…	…	…	…

For each measurement (CO2, humidity, temperature), take the average over an interval of 30 min.
If there are missing measurements, interpolate them linearly from measurements that are close by in time.

读数据

DATA1 = '../data/carbosense-raw/CO2_sensor_measurements.csv'
DATA2 = '../data/carbosense-raw/sensors_metadata_updated.csv'
DATA3 = '../data/carbosense-raw/temperature_humidity.csv'

df1 = pd.read_csv(DATA1, parse_dates=[0], sep='\t')
df2 = pd.read_csv(DATA2)
df3 = pd.read_csv(DATA3, parse_dates=[0], sep='\t')

发现df1的timestamp是15分钟间隔，而df3是10分钟间隔，需要先各自取平均，再合并表格

处理df1 (resample,interpolate)

# 每个传感器，每30min取均值
# 这里要先用groupby，再resample，否则会把所有传感器的数据合到一起
# 得到的结果中有多余的SensorUnit_ID列，好像也取了个均值，变成了float，这里直接drop了
df1 = df1.groupby(['LocationName','SensorUnit_ID']).resample('30min', on='timestamp').mean().drop(columns=['SensorUnit_ID'])
# 按30min为一个间隔重组后，出现一些nan，插值填上
df1 = df1.interpolate(method='linear')
# 把multiindex拆开
df1 = df1.reset_index()

处理df2

# drop掉不需要的列
df2 = df2.drop(columns = ['Unnamed: 0', 'X', 'Y'])
# 根据结果的需要重命名
df2 = df2.rename(columns={"LON":"lon", "LAT":"lat"})

处理df3 (melt,pivot,split,zip)

# 根据结果的需要重命名
df3 = df3.rename(columns={"Timestamp":"timestamp"})
# 同df1，按30min取均值，填补缺失数据，重设index
df3 = df3.resample('30min', on='timestamp').mean()
df3 = df3.interpolate(method='linear')
df3 = df3.reset_index()

# 此时每一个timestamp的所有数据都还在一行中，要把它们拆开，扁宽的表变成细长的
df3 = df3.melt(id_vars=['timestamp'])

pandas.DataFrame.melt后的结果：

# 此时sensor id和temperature/humidity作为值存在variable中
# 先把字符串拆开，然后分别存到两个新的列中，最后删除旧的variable列
df3['SensorUnit_ID'], df3['type'] = zip(*df3['variable'].str.split('.'))
df3.drop('variable', inplace=True, axis=1)
# 将object类型的id转为int类型，方便后面merge
df3['SensorUnit_ID'] = df3['SensorUnit_ID'].astype(str).astype(int)

# 此时temperature和humidity作为值存在type中，要分别变成列
# 用pivot函数，先把id和timestamp作为multiindex，type拆开
# Note：只用timestamp无法唯一标识，会报错Index contains duplicate entries, cannot reshape
df3 = df3.pivot(index=['SensorUnit_ID','timestamp'], columns=['type'], values='value')

pandas.DataFrame.pivot后：

1 2	# 重设index，方便后面merge df3 = df3.reset_index()

最终：

合并

# 根据SensorUnit_ID和timestamp合并df1和df3
df_merged = pd.merge(df1, df3, on=['SensorUnit_ID','timestamp'], how='outer')
# 再根据LocationName合并df2
df_merged = pd.merge(df_merged, df2, on=['LocationName'], how='outer')

# 根据要求将列重新排序
df_merged = df_merged[['timestamp','LocationName','SensorUnit_ID','CO2','temperature','humidity','zone','altitude','lon','lat']]

# 最后排个序
df_merged = df_merged.sort_values(by=['SensorUnit_ID', 'timestamp'])
# 根据要求将timestamp设为index
df_merged = df_merged.set_index('timestamp')

感觉有点奇怪，timestamp作index不也有duplicate吗？但是比较了前后的shape，是一样的。选了几个record也都没问题。🤔