处理time series笔记
coconutnut

目标

Merge the CO2_sensor_measurements.csv, temperature_humidity.csv, and sensors_metadata_updated.csv, into a single dataframe.

  • The merged dataframe contains:
    • index: the time instance timestamp of the measurements
    • columns: the location of the site LocationName, the sensor ID SensorUnit_ID, the CO2 measurement CO2, the temperature, the humidity, the zone, the altitude, the longitude lon and the latitude lat.
timestamp LocationName SensorUnit_ID CO2 temperature humidity zone altitude lon lat
  • For each measurement (CO2, humidity, temperature), take the average over an interval of 30 min.
  • If there are missing measurements, interpolate them linearly from measurements that are close by in time.

读数据

1
2
3
4
5
6
7
DATA1 = '../data/carbosense-raw/CO2_sensor_measurements.csv'
DATA2 = '../data/carbosense-raw/sensors_metadata_updated.csv'
DATA3 = '../data/carbosense-raw/temperature_humidity.csv'

df1 = pd.read_csv(DATA1, parse_dates=[0], sep='\t')
df2 = pd.read_csv(DATA2)
df3 = pd.read_csv(DATA3, parse_dates=[0], sep='\t')

发现df1的timestamp是15分钟间隔,而df3是10分钟间隔,需要先各自取平均,再合并表格

处理df1 (resample,interpolate)

1
2
3
4
5
6
7
8
# 每个传感器,每30min取均值
# 这里要先用groupby,再resample,否则会把所有传感器的数据合到一起
# 得到的结果中有多余的SensorUnit_ID列,好像也取了个均值,变成了float,这里直接drop了
df1 = df1.groupby(['LocationName','SensorUnit_ID']).resample('30min', on='timestamp').mean().drop(columns=['SensorUnit_ID'])
# 按30min为一个间隔重组后,出现一些nan,插值填上
df1 = df1.interpolate(method='linear')
# 把multiindex拆开
df1 = df1.reset_index()

处理df2

1
2
3
4
# drop掉不需要的列
df2 = df2.drop(columns = ['Unnamed: 0', 'X', 'Y'])
# 根据结果的需要重命名
df2 = df2.rename(columns={"LON":"lon", "LAT":"lat"})

处理df3 (melt,pivot,split,zip)

1
2
3
4
5
6
7
8
9
# 根据结果的需要重命名
df3 = df3.rename(columns={"Timestamp":"timestamp"})
# 同df1,按30min取均值,填补缺失数据,重设index
df3 = df3.resample('30min', on='timestamp').mean()
df3 = df3.interpolate(method='linear')
df3 = df3.reset_index()

# 此时每一个timestamp的所有数据都还在一行中,要把它们拆开,扁宽的表变成细长的
df3 = df3.melt(id_vars=['timestamp'])

pandas.DataFrame.melt后的结果:

1
2
3
4
5
6
# 此时sensor id和temperature/humidity作为值存在variable中
# 先把字符串拆开,然后分别存到两个新的列中,最后删除旧的variable列
df3['SensorUnit_ID'], df3['type'] = zip(*df3['variable'].str.split('.'))
df3.drop('variable', inplace=True, axis=1)
# 将object类型的id转为int类型,方便后面merge
df3['SensorUnit_ID'] = df3['SensorUnit_ID'].astype(str).astype(int)
1
2
3
4
# 此时temperature和humidity作为值存在type中,要分别变成列
# 用pivot函数,先把id和timestamp作为multiindex,type拆开
# Note:只用timestamp无法唯一标识,会报错Index contains duplicate entries, cannot reshape
df3 = df3.pivot(index=['SensorUnit_ID','timestamp'], columns=['type'], values='value')

pandas.DataFrame.pivot后:

1
2
# 重设index,方便后面merge
df3 = df3.reset_index()

最终:

合并

1
2
3
4
5
6
7
8
9
10
11
12
# 根据SensorUnit_ID和timestamp合并df1和df3
df_merged = pd.merge(df1, df3, on=['SensorUnit_ID','timestamp'], how='outer')
# 再根据LocationName合并df2
df_merged = pd.merge(df_merged, df2, on=['LocationName'], how='outer')

# 根据要求将列重新排序
df_merged = df_merged[['timestamp','LocationName','SensorUnit_ID','CO2','temperature','humidity','zone','altitude','lon','lat']]

# 最后排个序
df_merged = df_merged.sort_values(by=['SensorUnit_ID', 'timestamp'])
# 根据要求将timestamp设为index
df_merged = df_merged.set_index('timestamp')

感觉有点奇怪,timestamp作index不也有duplicate吗?但是比较了前后的shape,是一样的。选了几个record也都没问题。🤔