小结和复习
巩固 Python 基础语法 —— 从变量到循环的完整回顾
本章知识点总结
1. 变量与数据类型
核心概念:
- 变量:存储数据的容器,无需声明类型(动态类型)
- 五种基本数据类型:
int:整数(年龄、人口、年份)float:浮点数(收入、GDP、利率)str:字符串(姓名、地区、文本)bool:布尔值(True/False,是否就业)None:空值(缺失值)
命名规范:
# 好的命名
student_age = 25
avg_income = 50000
is_employed = True
# 不好的命名
a = 25 # 太简短
StudentAge = 25 # 不符合 Python 风格
2020_data = 100 # 不能以数字开头类型转换:
age = int("25") # str → int
income = float("50000") # str → float
text = str(123) # int → str2. 运算符
算术运算符:
+ # 加法
- # 减法
* # 乘法
/ # 除法(浮点数结果)
// # 整除(整数结果)
% # 取余
** # 幂运算比较运算符:
== # 等于
!= # 不等于
> # 大于
< # 小于
>= # 大于等于
<= # 小于等于逻辑运算符:
and # 与(两个条件都为真)
or # 或(至少一个条件为真)
not # 非(取反)运算优先级(从高到低):
**(幂运算)*,/,//,%(乘除)+,-(加减)==,!=,>,<,>=,<=(比较)notandor
3. 条件语句
基本语法:
if condition:
# 条件为真时执行
elif another_condition:
# 前面条件为假,这个条件为真时执行
else:
# 所有条件都为假时执行实际应用:
# 收入分组
if income < 30000:
income_group = "低收入"
elif income < 80000:
income_group = "中等收入"
else:
income_group = "高收入"
# 条件表达式(三元运算符)
status = "合格" if score >= 60 else "不合格"多条件判断:
# 使用 and
if age >= 18 and income > 0:
print("有效样本")
# 使用 or
if gender == "Male" or gender == "Female":
print("性别有效")
# 使用 in(更优雅)
if gender in ["Male", "Female", "Other"]:
print("性别有效")4. 循环
for 循环(遍历序列):
# 遍历列表
ages = [25, 30, 35, 40]
for age in ages:
print(age)
# 遍历范围
for i in range(5): # 0, 1, 2, 3, 4
print(i)
# 带索引遍历
for index, age in enumerate(ages):
print(f"第 {index} 个: {age}")while 循环(基于条件):
count = 0
while count < 5:
print(count)
count += 1循环控制:
# break: 跳出循环
for i in range(10):
if i == 5:
break # 遇到 5 就停止
print(i)
# continue: 跳过当前迭代
for i in range(5):
if i == 2:
continue # 跳过 2
print(i) # 输出: 0, 1, 3, 4
# else: 正常结束循环后执行
for i in range(3):
print(i)
else:
print("循环正常结束")列表推导式(简洁的循环):
# 传统循环
squares = []
for x in range(5):
squares.append(x ** 2)
# 列表推导式(更简洁)
squares = [x ** 2 for x in range(5)]
# 带条件的列表推导式
evens = [x for x in range(10) if x % 2 == 0]知识点速查表
Python vs Stata vs R 对比
| 操作 | Python | Stata | R |
|---|---|---|---|
| 创建变量 | age = 25 | gen age = 25 | age <- 25 |
| 条件语句 | if age > 18: | if age > 18 { | if (age > 18) { |
| 数值循环 | for i in range(10): | forvalues i = 1/10 { | for (i in 1:10) { |
| 列表循环 | for x in list: | foreach x in list { | for (x in list) { |
| 逻辑与 | and | & | & |
| 逻辑或 | or | ` | ` |
| 整除 | 10 // 3 | floor(10/3) | 10 %/% 3 |
| 取余 | 10 % 3 | mod(10, 3) | 10 %% 3 |
常用模式速查
# 模式 1: 数据验证
if 18 <= age <= 100 and income > 0:
print("有效数据")
# 模式 2: 分组统计
income_groups = {"低": 0, "中": 0, "高": 0}
for income in incomes:
if income < 30000:
income_groups["低"] += 1
elif income < 80000:
income_groups["中"] += 1
else:
income_groups["高"] += 1
# 模式 3: 列表过滤
valid_ages = [age for age in ages if 18 <= age <= 100]
# 模式 4: 累加计算
total = 0
for income in incomes:
total += income
average = total / len(incomes)
# 模式 5: 条件计数
count = sum(1 for age in ages if age > 30)️ 易错点和最佳实践
易错点 1: 缩进错误
# 错误(缩进不一致)
if age > 18:
print("成年")
print("可以投票") # 缩进不一致
# 正确(使用 4 个空格)
if age > 18:
print("成年")
print("可以投票")易错点 2: == vs =
# 错误(赋值而非比较)
if age = 18: # SyntaxError
print("18岁")
# 正确(比较运算符)
if age == 18:
print("18岁")易错点 3: 整除 vs 浮点除法
# Python 3 中 / 总是返回浮点数
print(10 / 3) # 3.3333...
print(10 // 3) # 3(整除)
# Stata/R 的默认除法更像 //易错点 4: range() 不包含结束值
# 误解
for i in range(1, 5):
print(i) # 输出: 1, 2, 3, 4(不包含 5!)
# 正确理解
for i in range(1, 6): # 要包含 5,需要写 6
print(i) # 输出: 1, 2, 3, 4, 5易错点 5: 修改循环中的列表
# 错误(循环中修改列表可能导致问题)
ages = [15, 25, 35, 45]
for age in ages:
if age < 18:
ages.remove(age) # 危险!
# 正确(使用列表推导式)
ages = [age for age in ages if age >= 18]
# 或者创建新列表
valid_ages = []
for age in ages:
if age >= 18:
valid_ages.append(age)最佳实践 1: 避免深层嵌套
# 不好(嵌套太深)
if age > 18:
if income > 0:
if gender in ["Male", "Female"]:
if education >= 12:
print("有效样本")
# 更好(提前返回 / 使用 and)
if age > 18 and income > 0 and gender in ["Male", "Female"] and education >= 12:
print("有效样本")
# 或者使用函数
def is_valid_sample(age, income, gender, education):
if age <= 18:
return False
if income <= 0:
return False
if gender not in ["Male", "Female"]:
return False
if education < 12:
return False
return True最佳实践 2: 使用有意义的变量名
# 不好
for i in data:
if i > 0:
total += i
# 更好
for income in incomes:
if income > 0:
total_income += income最佳实践 3: 善用 in 运算符
# 不够优雅
if gender == "Male" or gender == "Female" or gender == "Other":
print("有效")
# 更优雅
if gender in ["Male", "Female", "Other"]:
print("有效")
# 更高效(使用集合)
VALID_GENDERS = {"Male", "Female", "Other"}
if gender in VALID_GENDERS:
print("有效")综合练习题
基础巩固题(1-3题)
练习 1: 收入税计算器
题目描述: 编写一个程序,根据年收入计算应缴税额。税率规则如下:
- 收入 ≤ 30,000: 免税
- 30,000 < 收入 ≤ 80,000: 税率 10%
- 80,000 < 收入 ≤ 150,000: 税率 20%
- 收入 > 150,000: 税率 30%
要求:
- 定义函数
calculate_tax(income) - 返回应缴税额(浮点数)
- 处理负收入(返回 0)
输入输出示例:
calculate_tax(25000) # 输出: 0
calculate_tax(50000) # 输出: 5000.0
calculate_tax(100000) # 输出: 20000.0
calculate_tax(-1000) # 输出: 0提示
使用 if-elif-else 结构:
def calculate_tax(income):
if income <= 30000:
return 0
elif income <= 80000:
return income * 0.1
# 继续补充...参考答案
def calculate_tax(income):
"""
计算年收入应缴税额
Parameters:
income (float): 年收入
Returns:
float: 应缴税额
"""
# 处理负收入
if income <= 0:
return 0
# 税率计算
if income <= 30000:
tax = 0
elif income <= 80000:
tax = income * 0.1
elif income <= 150000:
tax = income * 0.2
else:
tax = income * 0.3
return tax
# 测试
print(calculate_tax(25000)) # 0
print(calculate_tax(50000)) # 5000.0
print(calculate_tax(100000)) # 20000.0
print(calculate_tax(200000)) # 60000.0
print(calculate_tax(-1000)) # 0练习 2: 数据清洗 - 异常值检测
题目描述: 你有一份问卷调查数据(年龄列表),需要清洗异常值。
要求:
- 删除年龄 < 18 或 > 100 的样本
- 删除缺失值(None)
- 返回清洗后的列表和删除的样本数量
输入输出示例:
ages = [25, 150, 30, None, 15, 35, -5, 40, 200, 28]
clean_ages, removed_count = clean_age_data(ages)
print(clean_ages) # [25, 30, 35, 40, 28]
print(removed_count) # 5提示
使用列表推导式配合条件判断:
clean_ages = [age for age in ages if age is not None and 18 <= age <= 100]参考答案
def clean_age_data(ages):
"""
清洗年龄数据,删除异常值和缺失值
Parameters:
ages (list): 年龄列表(可能包含 None 和异常值)
Returns:
tuple: (清洗后的列表, 删除的样本数)
"""
# 方法 1: 列表推导式
clean_ages = [age for age in ages
if age is not None and 18 <= age <= 100]
removed_count = len(ages) - len(clean_ages)
return clean_ages, removed_count
# 方法 2: 传统循环(更详细)
def clean_age_data_v2(ages):
clean_ages = []
removed_count = 0
for age in ages:
# 检查是否为 None
if age is None:
removed_count += 1
continue
# 检查范围
if 18 <= age <= 100:
clean_ages.append(age)
else:
removed_count += 1
return clean_ages, removed_count
# 测试
ages = [25, 150, 30, None, 15, 35, -5, 40, 200, 28]
clean, removed = clean_age_data(ages)
print(f"清洗后: {clean}")
print(f"删除了 {removed} 个样本")练习 3: 成绩等级转换
题目描述: 将数字成绩转换为等级(A/B/C/D/F)。
规则:
- A: 90-100
- B: 80-89
- C: 70-79
- D: 60-69
- F: 0-59
- 无效成绩(<0 或 >100)返回 "Invalid"
要求:
- 编写函数
score_to_grade(score) - 批量处理成绩列表
输入输出示例:
score_to_grade(95) # "A"
score_to_grade(75) # "C"
score_to_grade(55) # "F"
score_to_grade(105) # "Invalid"
scores = [95, 85, 75, 65, 55, 105, -10]
grades = batch_convert(scores)
print(grades) # ['A', 'B', 'C', 'D', 'F', 'Invalid', 'Invalid']参考答案
def score_to_grade(score):
"""
将数字成绩转换为等级
Parameters:
score (int/float): 成绩(0-100)
Returns:
str: 等级(A/B/C/D/F 或 Invalid)
"""
# 检查有效性
if score < 0 or score > 100:
return "Invalid"
# 等级判断
if score >= 90:
return "A"
elif score >= 80:
return "B"
elif score >= 70:
return "C"
elif score >= 60:
return "D"
else:
return "F"
def batch_convert(scores):
"""批量转换成绩"""
return [score_to_grade(score) for score in scores]
# 测试
print(score_to_grade(95)) # A
print(score_to_grade(75)) # C
print(score_to_grade(55)) # F
print(score_to_grade(105)) # Invalid
scores = [95, 85, 75, 65, 55, 105, -10]
grades = batch_convert(scores)
print(grades)综合应用题(4-7题)
练习 4: 收入分组统计
题目描述: 对一组收入数据进行分组统计,计算各组的人数和平均收入。
收入分组:
- 低收入: < 30,000
- 中等收入: 30,000 - 80,000
- 高收入: > 80,000
要求:
- 返回字典,包含各组的人数和平均收入
- 处理空列表的情况
输入输出示例:
incomes = [25000, 50000, 85000, 30000, 75000, 120000, 20000, 90000]
result = income_statistics(incomes)
print(result)
# {
# '低收入': {'count': 2, 'average': 22500.0},
# '中等收入': {'count': 3, 'average': 51666.67},
# '高收入': {'count': 3, 'average': 98333.33}
# }提示
- 先遍历数据,分组收集
- 再计算每组的统计量
参考答案
def income_statistics(incomes):
"""
对收入数据进行分组统计
Parameters:
incomes (list): 收入列表
Returns:
dict: 各组的统计信息
"""
# 处理空列表
if not incomes:
return {}
# 初始化分组
groups = {
'低收入': [],
'中等收入': [],
'高收入': []
}
# 分组
for income in incomes:
if income < 30000:
groups['低收入'].append(income)
elif income <= 80000:
groups['中等收入'].append(income)
else:
groups['高收入'].append(income)
# 统计
result = {}
for group_name, group_incomes in groups.items():
if group_incomes: # 如果该组有数据
result[group_name] = {
'count': len(group_incomes),
'average': sum(group_incomes) / len(group_incomes)
}
return result
# 测试
incomes = [25000, 50000, 85000, 30000, 75000, 120000, 20000, 90000]
result = income_statistics(incomes)
for group, stats in result.items():
print(f"{group}: {stats['count']}人, 平均 ${stats['average']:,.2f}")练习 5: 素数判断与生成
题目描述: 编写程序判断一个数是否为素数,并生成指定范围内的所有素数。
要求:
- 函数
is_prime(n)判断 n 是否为素数 - 函数
generate_primes(start, end)生成范围内所有素数
输入输出示例:
is_prime(7) # True
is_prime(10) # False
primes = generate_primes(1, 20)
print(primes) # [2, 3, 5, 7, 11, 13, 17, 19]提示
素数定义:只能被 1 和自己整除的大于 1 的自然数。 只需检查到 √n 即可。
参考答案
def is_prime(n):
"""
判断一个数是否为素数
Parameters:
n (int): 待判断的数
Returns:
bool: 是否为素数
"""
# 小于 2 的数不是素数
if n < 2:
return False
# 2 是素数
if n == 2:
return True
# 偶数不是素数
if n % 2 == 0:
return False
# 只需检查到 √n
for i in range(3, int(n ** 0.5) + 1, 2):
if n % i == 0:
return False
return True
def generate_primes(start, end):
"""
生成指定范围内的所有素数
Parameters:
start (int): 起始值
end (int): 结束值(包含)
Returns:
list: 素数列表
"""
# 方法 1: 列表推导式
return [n for n in range(start, end + 1) if is_prime(n)]
# 方法 2: 传统循环
# primes = []
# for n in range(start, end + 1):
# if is_prime(n):
# primes.append(n)
# return primes
# 测试
print(is_prime(2)) # True
print(is_prime(7)) # True
print(is_prime(10)) # False
print(is_prime(17)) # True
primes = generate_primes(1, 50)
print(f"1-50 之间的素数: {primes}")
print(f"共 {len(primes)} 个")练习 6: 问卷编码器
题目描述: 将问卷调查的文本答案转换为数值编码。
编码规则:
- 教育水平:
- 收入水平:
要求:
- 处理大小写不敏感
- 无法识别的答案标记为 -1
- 批量处理多个答案
输入输出示例:
education_responses = ["大学", "高中", "研究生", "大学", "未知学历"]
codes = encode_education(education_responses)
print(codes) # [4, 3, 5, 4, -1]参考答案
def encode_education(responses):
"""
将教育水平文本转换为数值编码
Parameters:
responses (list): 教育水平文本列表
Returns:
list: 数值编码列表
"""
# 编码映射表
mapping = {
"小学": 1,
"初中": 2,
"高中": 3,
"大学": 4,
"研究生": 5
}
# 批量编码
codes = []
for response in responses:
# 大小写不敏感处理
response = response.strip() # 去除空格
code = mapping.get(response, -1) # 找不到返回 -1
codes.append(code)
return codes
def encode_income_level(responses):
"""
将收入水平文本转换为数值编码
Parameters:
responses (list): 收入水平文本列表
Returns:
list: 数值编码列表
"""
mapping = {
"很低": 1,
"较低": 2,
"中等": 3,
"较高": 4,
"很高": 5
}
return [mapping.get(r.strip(), -1) for r in responses]
# 通用编码器
def encode_responses(responses, mapping):
"""
通用编码器
Parameters:
responses (list): 文本答案列表
mapping (dict): 编码映射表
Returns:
list: 数值编码列表
"""
return [mapping.get(r.strip(), -1) for r in responses]
# 测试
education_responses = ["大学", "高中", "研究生", "大学", "未知学历"]
education_codes = encode_education(education_responses)
print(f"教育编码: {education_codes}")
income_responses = ["中等", "较高", "很低", "中等", "不知道"]
income_codes = encode_income_level(income_responses)
print(f"收入编码: {income_codes}")
# 使用通用编码器
education_mapping = {
"小学": 1, "初中": 2, "高中": 3, "大学": 4, "研究生": 5
}
codes = encode_responses(education_responses, education_mapping)
print(f"通用编码: {codes}")练习 7: 数据验证器
题目描述: 编写一个数据验证系统,检查问卷数据的有效性。
验证规则:
- 年龄: 18-100
- 收入: > 0
- 性别: "Male", "Female", "Other"
- 教育年限: 0-25
要求:
- 返回每条记录的验证结果(True/False)
- 统计无效记录的数量和比例
- 列出每个字段的无效数量
输入输出示例:
data = [
{"age": 25, "income": 50000, "gender": "Male", "education": 16},
{"age": 150, "income": 60000, "gender": "Female", "education": 18},
{"age": 30, "income": -1000, "gender": "Male", "education": 14},
{"age": 35, "income": 70000, "gender": "Unknown", "education": 12},
]
results = validate_survey_data(data)
print(results)
# {
# 'valid_count': 1,
# 'invalid_count': 3,
# 'invalid_rate': 0.75,
# 'field_errors': {
# 'age': 1,
# 'income': 1,
# 'gender': 1,
# 'education': 0
# },
# 'valid_records': [True, False, False, False]
# }参考答案
def validate_record(record):
"""
验证单条记录
Parameters:
record (dict): 包含 age, income, gender, education 的字典
Returns:
tuple: (是否有效, 错误字段列表)
"""
errors = []
# 验证年龄
if not (18 <= record['age'] <= 100):
errors.append('age')
# 验证收入
if record['income'] <= 0:
errors.append('income')
# 验证性别
if record['gender'] not in ["Male", "Female", "Other"]:
errors.append('gender')
# 验证教育年限
if not (0 <= record['education'] <= 25):
errors.append('education')
is_valid = len(errors) == 0
return is_valid, errors
def validate_survey_data(data):
"""
批量验证问卷数据
Parameters:
data (list): 问卷记录列表
Returns:
dict: 验证结果统计
"""
valid_count = 0
invalid_count = 0
field_errors = {'age': 0, 'income': 0, 'gender': 0, 'education': 0}
valid_records = []
# 逐条验证
for record in data:
is_valid, errors = validate_record(record)
if is_valid:
valid_count += 1
else:
invalid_count += 1
# 统计各字段错误数
for field in errors:
field_errors[field] += 1
valid_records.append(is_valid)
# 计算无效比例
total = len(data)
invalid_rate = invalid_count / total if total > 0 else 0
return {
'valid_count': valid_count,
'invalid_count': invalid_count,
'invalid_rate': invalid_rate,
'field_errors': field_errors,
'valid_records': valid_records
}
# 测试
data = [
{"age": 25, "income": 50000, "gender": "Male", "education": 16},
{"age": 150, "income": 60000, "gender": "Female", "education": 18},
{"age": 30, "income": -1000, "gender": "Male", "education": 14},
{"age": 35, "income": 70000, "gender": "Unknown", "education": 12},
]
results = validate_survey_data(data)
print(f"有效记录: {results['valid_count']}/{len(data)}")
print(f"无效比例: {results['invalid_rate']:.1%}")
print(f"字段错误统计: {results['field_errors']}")
print(f"各记录验证结果: {results['valid_records']}")挑战题(8-10题)
练习 8: 收入基尼系数计算器
题目描述: 计算一组收入数据的基尼系数(Gini Coefficient),用于衡量收入不平等程度。
基尼系数公式(简化版):
Gini = (2 * Σ(i * income_i)) / (n * Σ(income_i)) - (n + 1) / n其中 income 按从小到大排序,i 从 1 开始。
要求:
- 实现基尼系数计算
- 处理负收入和零收入(过滤掉)
- 返回基尼系数(0-1 之间,越大越不平等)
输入输出示例:
incomes = [30000, 50000, 50000, 70000, 150000]
gini = calculate_gini(incomes)
print(f"基尼系数: {gini:.3f}") # 约 0.24提示
- 先过滤负值和零
- 排序收入数据
- 按公式计算
参考答案
def calculate_gini(incomes):
"""
计算基尼系数
Parameters:
incomes (list): 收入列表
Returns:
float: 基尼系数(0-1)
"""
# 过滤负值和零
valid_incomes = [inc for inc in incomes if inc > 0]
# 处理空列表或单个元素
if len(valid_incomes) <= 1:
return 0.0
# 排序(从小到大)
sorted_incomes = sorted(valid_incomes)
n = len(sorted_incomes)
# 计算公式
numerator = 0
for i, income in enumerate(sorted_incomes, start=1):
numerator += i * income
denominator = n * sum(sorted_incomes)
gini = (2 * numerator) / denominator - (n + 1) / n
return gini
# 更详细的版本(带统计信息)
def income_inequality_analysis(incomes):
"""
完整的收入不平等分析
Returns:
dict: 包含基尼系数和其他统计量
"""
# 过滤有效数据
valid_incomes = [inc for inc in incomes if inc > 0]
if len(valid_incomes) == 0:
return None
# 排序
sorted_incomes = sorted(valid_incomes)
n = len(sorted_incomes)
# 基尼系数
gini = calculate_gini(valid_incomes)
# 其他统计量
total_income = sum(sorted_incomes)
mean_income = total_income / n
median_income = sorted_incomes[n // 2]
# 收入分位数
p10 = sorted_incomes[int(n * 0.1)]
p50 = median_income
p90 = sorted_incomes[int(n * 0.9)]
# Top 10% 收入占比
top10_start = int(n * 0.9)
top10_income = sum(sorted_incomes[top10_start:])
top10_share = top10_income / total_income
return {
'gini': gini,
'mean': mean_income,
'median': median_income,
'p10': p10,
'p50': p50,
'p90': p90,
'p90_p10_ratio': p90 / p10 if p10 > 0 else None,
'top10_share': top10_share,
'sample_size': n
}
# 测试
incomes = [30000, 50000, 50000, 70000, 150000, 40000, 60000, 80000]
gini = calculate_gini(incomes)
print(f"基尼系数: {gini:.3f}")
# 完整分析
analysis = income_inequality_analysis(incomes)
print("\n收入不平等分析:")
print(f" 基尼系数: {analysis['gini']:.3f}")
print(f" 平均收入: ${analysis['mean']:,.0f}")
print(f" 中位数收入: ${analysis['median']:,.0f}")
print(f" P90/P10 比率: {analysis['p90_p10_ratio']:.2f}")
print(f" Top 10% 收入占比: {analysis['top10_share']:.1%}")练习 9: 问卷逻辑跳转验证器
题目描述: 问卷调查中常有逻辑跳转(如"如果你已婚,请回答配偶信息")。编写程序验证逻辑跳转的正确性。
逻辑规则:
- 如果
is_married == False,则spouse_age和spouse_income必须为 None - 如果
is_employed == False,则occupation和work_years必须为 None - 如果
has_children == False,则num_children必须为 0 或 None
要求:
- 检测逻辑矛盾
- 返回详细的错误信息
- 提供修复建议
输入输出示例:
record = {
"id": 1001,
"is_married": False,
"spouse_age": 30, # 逻辑错误!
"spouse_income": 50000,
"is_employed": True,
"occupation": "Teacher",
"work_years": 5,
"has_children": False,
"num_children": 0 # 这个可以
}
errors = validate_logic(record)
# [
# "记录 1001: 未婚但填写了配偶年龄",
# "记录 1001: 未婚但填写了配偶收入"
# ]参考答案
def validate_logic(record):
"""
验证问卷逻辑跳转
Parameters:
record (dict): 问卷记录
Returns:
list: 错误信息列表(空列表表示无错误)
"""
errors = []
record_id = record.get('id', 'Unknown')
# 规则 1: 婚姻状态逻辑
if not record.get('is_married', False):
if record.get('spouse_age') is not None:
errors.append(f"记录 {record_id}: 未婚但填写了配偶年龄")
if record.get('spouse_income') is not None:
errors.append(f"记录 {record_id}: 未婚但填写了配偶收入")
# 规则 2: 就业状态逻辑
if not record.get('is_employed', False):
if record.get('occupation') is not None:
errors.append(f"记录 {record_id}: 未就业但填写了职业")
if record.get('work_years') is not None:
errors.append(f"记录 {record_id}: 未就业但填写了工作年限")
# 规则 3: 子女逻辑
if not record.get('has_children', False):
num_children = record.get('num_children')
if num_children is not None and num_children > 0:
errors.append(f"记录 {record_id}: 无子女但子女数量 > 0")
return errors
def fix_logic_errors(record):
"""
自动修复逻辑错误
Parameters:
record (dict): 问卷记录
Returns:
dict: 修复后的记录
"""
fixed_record = record.copy()
# 修复婚姻逻辑
if not fixed_record.get('is_married', False):
fixed_record['spouse_age'] = None
fixed_record['spouse_income'] = None
# 修复就业逻辑
if not fixed_record.get('is_employed', False):
fixed_record['occupation'] = None
fixed_record['work_years'] = None
# 修复子女逻辑
if not fixed_record.get('has_children', False):
fixed_record['num_children'] = 0
return fixed_record
def batch_validate_and_fix(records):
"""
批量验证和修复
Parameters:
records (list): 问卷记录列表
Returns:
dict: 验证和修复结果
"""
all_errors = []
fixed_records = []
for record in records:
# 验证
errors = validate_logic(record)
if errors:
all_errors.extend(errors)
# 修复
fixed_record = fix_logic_errors(record)
fixed_records.append(fixed_record)
return {
'error_count': len(all_errors),
'errors': all_errors,
'fixed_records': fixed_records
}
# 测试
records = [
{
"id": 1001,
"is_married": False,
"spouse_age": 30,
"spouse_income": 50000,
"is_employed": True,
"occupation": "Teacher",
"work_years": 5,
"has_children": False,
"num_children": 0
},
{
"id": 1002,
"is_married": True,
"spouse_age": 35,
"spouse_income": 60000,
"is_employed": False,
"occupation": "Engineer", # 错误!
"work_years": None,
"has_children": True,
"num_children": 2
}
]
# 验证
for record in records:
errors = validate_logic(record)
if errors:
print(f"\n记录 {record['id']} 的错误:")
for error in errors:
print(f" - {error}")
else:
print(f"\n记录 {record['id']} 无错误")
# 批量处理
results = batch_validate_and_fix(records)
print(f"\n共发现 {results['error_count']} 个逻辑错误")
print(f"已自动修复所有记录")练习 10: 收入流动性矩阵
题目描述: 分析两年的收入数据,计算收入流动性矩阵(Transition Matrix),展示人们从一个收入组移动到另一个收入组的比例。
收入分组:
- 低收入: < 30,000
- 中等收入: 30,000 - 80,000
- 高收入: > 80,000
要求:
- 计算 3x3 的流动性矩阵
- 每个单元格表示"从 X 组到 Y 组的比例"
- 计算总体流动率(改变组别的人数比例)
- 可视化展示(使用 ASCII 表格)
输入输出示例:
year1_incomes = [25000, 50000, 85000, 30000, 75000]
year2_incomes = [30000, 55000, 90000, 75000, 80000]
matrix, mobility_rate = calculate_mobility_matrix(year1_incomes, year2_incomes)
# 输出矩阵:
# 低 → 低 低 → 中 低 → 高
# 低收入 0.0% 100.0% 0.0%
# 中等收入 0.0% 66.7% 33.3%
# 高收入 0.0% 0.0% 100.0%
#
# 总体流动率: 40.0%提示
- 先对两年的收入分别分组
- 统计从每个组到每个组的人数
- 计算比例(行和为 100%)
参考答案
def classify_income(income):
"""将收入分类"""
if income < 30000:
return "低收入"
elif income <= 80000:
return "中等收入"
else:
return "高收入"
def calculate_mobility_matrix(year1_incomes, year2_incomes):
"""
计算收入流动性矩阵
Parameters:
year1_incomes (list): 第一年收入
year2_incomes (list): 第二年收入
Returns:
tuple: (流动性矩阵, 总体流动率)
"""
# 确保数据长度一致
if len(year1_incomes) != len(year2_incomes):
raise ValueError("两年数据长度必须一致")
n = len(year1_incomes)
# 分组
groups = ["低收入", "中等收入", "高收入"]
# 初始化计数矩阵
# counts[i][j] 表示从组 i 到组 j 的人数
counts = {g1: {g2: 0 for g2 in groups} for g1 in groups}
group_totals = {g: 0 for g in groups}
# 统计转移
for inc1, inc2 in zip(year1_incomes, year2_incomes):
group1 = classify_income(inc1)
group2 = classify_income(inc2)
counts[group1][group2] += 1
group_totals[group1] += 1
# 计算比例矩阵
matrix = {}
for g1 in groups:
matrix[g1] = {}
total = group_totals[g1]
for g2 in groups:
if total > 0:
matrix[g1][g2] = counts[g1][g2] / total
else:
matrix[g1][g2] = 0.0
# 计算总体流动率(改变组别的人数比例)
moved_count = 0
for inc1, inc2 in zip(year1_incomes, year2_incomes):
if classify_income(inc1) != classify_income(inc2):
moved_count += 1
mobility_rate = moved_count / n if n > 0 else 0.0
return matrix, mobility_rate
def print_mobility_matrix(matrix, mobility_rate):
"""美化打印流动性矩阵"""
groups = ["低收入", "中等收入", "高收入"]
print("\n" + "=" * 60)
print("收入流动性矩阵(行 = 第1年,列 = 第2年)")
print("=" * 60)
# 表头
print(f"{'':12}", end="")
for g in groups:
print(f"{g:>12}", end="")
print()
print("-" * 60)
# 数据行
for g1 in groups:
print(f"{g1:12}", end="")
for g2 in groups:
pct = matrix[g1][g2] * 100
print(f"{pct:11.1f}%", end="")
print()
print("-" * 60)
print(f"\n总体流动率: {mobility_rate:.1%}")
print(f"(即 {mobility_rate:.1%} 的人改变了收入组别)")
print("=" * 60)
def analyze_mobility_patterns(matrix):
"""分析流动性模式"""
groups = ["低收入", "中等收入", "高收入"]
print("\n流动性分析:")
# 向上流动
print("\n向上流动:")
upward = matrix["低收入"]["中等收入"] + matrix["低收入"]["高收入"] + \
matrix["中等收入"]["高收入"]
print(f" 从低到中/高: {matrix['低收入']['中等收入']:.1%}")
print(f" 从中到高: {matrix['中等收入']['高收入']:.1%}")
# 向下流动
print("\n向下流动:")
print(f" 从高到中/低: {matrix['高收入']['中等收入'] + matrix['高收入']['低收入']:.1%}")
print(f" 从中到低: {matrix['中等收入']['低收入']:.1%}")
# 停留不动
print("\n停留不动:")
for g in groups:
print(f" {g}: {matrix[g][g]:.1%}")
# 测试
year1_incomes = [25000, 50000, 85000, 30000, 75000, 120000, 20000, 60000]
year2_incomes = [30000, 55000, 90000, 75000, 80000, 110000, 25000, 65000]
matrix, mobility_rate = calculate_mobility_matrix(year1_incomes, year2_incomes)
print_mobility_matrix(matrix, mobility_rate)
analyze_mobility_patterns(matrix)
# 额外:使用 Pandas 实现(更简洁)
def calculate_mobility_matrix_pandas(year1_incomes, year2_incomes):
"""使用 Pandas 实现(需要 import pandas)"""
import pandas as pd
df = pd.DataFrame({
'year1': year1_incomes,
'year2': year2_incomes
})
df['group1'] = df['year1'].apply(classify_income)
df['group2'] = df['year2'].apply(classify_income)
# 交叉表
crosstab = pd.crosstab(df['group1'], df['group2'], normalize='index')
# 流动率
mobility_rate = (df['group1'] != df['group2']).mean()
return crosstab, mobility_rate延伸阅读
官方文档
推荐资源
Stata/R 用户对照
下一步
恭喜完成 Module 3 的学习!你已经掌握了:
- Python 的基础语法(变量、运算、条件、循环)
- 10 个综合练习题,巩固了核心知识点
- 对比了 Python 与 Stata/R 的语法差异
建议:
- 复习易错点:重点关注缩进、运算符优先级、range() 的使用
- 多做练习:完成上面的 10 道题,尤其是挑战题
- 实践应用:用真实数据集练习数据清洗和验证
在 Module 4 中,我们将学习 Python 的数据结构(列表、字典、元组、集合),这是处理复杂数据的基础。
继续加油!