Google ds ng 面经

作为世界顶尖的科技公司之一，Google拥有庞大的用户体量，相应的，每天会产生数以亿计的数据量，针对如此巨大的数据量，谷歌也专门设有数据科学团队，从中挖掘宝贵的信息，为公司提供战略支持， Google DS 面试始终围绕五大核心维度稳定展开，分别是统计基础能力、数据敏感度、SQL/编程功底、业务解读能力以及Googliness。

Technical Coding

Question 1

Find the contiguous subarray within a one-dimensional array that has the largest sum.

要点

Optimal Strategy: Use a single-pass Dynamic Programming approach (Kadane’s).
Decision Logic: At each index, decide whether to “restart” the subarray or “extend” it.
Efficiency: O(n) Time, O(1) Space.

# Key Thinking: Compare the current element against the current element + running sum.
mx = nums[0]  # Global maximum
cur = nums[0]  # Local maximum ending at current index

for i in range(1, len(nums)):
    v = nums[i]
    # [******] Core Logic: Restart if current value is better than extending
    cur = max(v, cur + v)
    # [******] Update global max if local max improves
    mx = max(mx, cur)
return mx

Question 2

Count missing Price values and fill them with the median price of the given item based on its Description.

要点

计数：使用 .isnull().sum() 实现向量化高效运算
分组规则：使用 groupby('Description') 标记确保价格上下文准确（例如闹钟显示闹钟的价格）。
插值处理：使用 transform('median') 将分组值广播回原始索引。

# Part 1: Count
cnt = df['Price'].isnull().sum()

# Part 2: Fill
# [******] Key Thinking: Use transform to align group-stats with original dataframe size
meds = df.groupby('Description')['Price'].transform('median')
df['Price'] = df['Price'].fillna(meds)

Question 3

Calculate the number of flights, unique airlines, and average price for each unique route (combination of from and to).

问题：计算每条独特航线（from 和 to 的组合）的航班数量、不同航空公司数量及平均价格。

要点

聚合：使用 groupby(['from', 'to'])。
指标说明：使用 count 表示数量，nunique 表示多样性，mean 表示价格。
结构化处理：采用命名聚合实现简洁输出

# [******] Key Thinking: Perform multiple distinct operations in a single grouped pass
res = df.groupby(['from', 'to']).agg(
    num_flights=('price', 'count'),
    num_airlines=('airline', 'nunique'),
    avg_price=('price', 'mean')
).reset_index()

Business Case

Question 1

A PE client is choosing between Target 1 and Target 2. How would you determine which company is more likely to win “greenfield” (non-customer) prospects?

问题：一家私募股权客户正在目标公司 1 和目标公司 2 之间做选择。如何判断哪家企业更有可能赢得”绿地”（非现有客户）潜在客户？

要点

Methodology: Build a Look-alike Model using third-party firmographic (FTE size, Location) and technographic (Software stack) data.
Logic: If non-customers share a profile with Target 1’s existing base, they are statistically more likely to convert to Target 1.

Detailed Breakdown:

First, I would perform feature engineering on the “Software stack” to identify if a company is a “Microsoft shop” or a “Google shop.” Second, I would train a classification model on the 10k records to identify the distinct “signatures” of Target 1 vs. Target 2 customers. Third, I would apply this model to the “Neither” pool to predict the probability of acquisition for each prospect.

Why GBDT?

Question: Why choose Gradient Boosted Decision Trees (XGBoost/LightGBM) over simpler models?

为什么选择梯度提升决策树（XGBoost/LightGBM）而非更简单的模型？

要点

Non-linearity: Captures thresholds (e.g., “Companies > 500 FTEs behave differently”).
Robustness: Handles missing data and categorical strings without heavy preprocessing.
Explainability: Provides Feature Importance which is crucial for the PE client’s strategy.

Detailed Breakdown:

First, tree-based models naturally handle the interaction between features, such as how “Location” might matter more for certain “Industries.” Second, they are invariant to outliers, which is common in company size data (e.g., small startups vs. massive enterprises). Third, they allow us to explain the “Why” to the client using feature rankings.

Handling Class Imbalance (900 vs. 100)

Question: If Target 1 has 900 customers and Target 2 only has 100, how do you prevent model bias?

问题：若目标 1 有 900 位客户而目标 2 仅有 100 位，该如何避免模型产生偏差？

要点

Resampling: Use SMOTE to synthesize Target 2 examples.
Weighting: Adjust scale_pos_weight to penalize Target 2 errors more heavily.
Metrics: Focus on F1-Score and Precision-Recall instead of raw Accuracy.

Detailed Breakdown:

First, I would use cost-sensitive learning to tell the model that missing a Target 2 customer is 9x more “expensive.” Second, I would use Stratified K-Fold validation to ensure the minority class is represented in every training split. Third, I would evaluate success based on how well the model identifies the specific “niche” of Target 2, regardless of its lower volume.

Managing Intuition vs. Data

Question: If a Partner disagrees with a specific prediction (e.g., Company 2), how do you respond?

问：如果合作伙伴对某项具体预测有异议（例如公司 2），您会如何回应？

要点

Transparency: Use SHAP values to show the specific data points (e.g., “Google Cloud use”) driving the result.
Collaboration: Identify if the Partner has “unseen” info (e.g., a private board connection) not in the dataset.
Stress Testing: Perform “What-if” analysis to see what it would take for the model to flip its decision.

Detailed Breakdown:

First, I would validate the partner’s expertise while using SHAP force plots to show the statistical “tug-of-war” occurring within the data. Second, I would treat their disagreement as a chance for feature discovery—perhaps we are missing a critical variable they know from experience. Third, I would frame the model as a probabilistic tool that identifies statistical similarity, not a definitive certainty.

个人感受

准备技巧：复习核心知识：重点复习sql的window function，stats知识点，product case经典ab testing/metrics题案例练习：多做产品相关的数据分析案例，理解如何将数据转化为产品洞察模拟面试：找朋友mock interview提供feedback，答题有逻辑性且回答全面很重要。
面试过程中保持自信：面试过程中要自信地表达，展示问题解决能力和产品思维逻辑展现热情：让面试官看到你对数据和产品的热情，以及你如何利用数据推动产品发展。

如果你也在准备 Google ds 或其他大厂的 OA／VO，可以直接联系 interviewAid 了解对应的面试辅助和陪跑支持。如果你想找我辅助面试，或者用 Google ds面经中的原题 mock，感受最真实的feedback，欢迎戳我。

目录

页面信息

Google ds ng 面经 | Google 数据科学家面试全纪录