JavaEar 专注于收集分享传播有价值的技术资料

Tiling data to create a pandas dataframe

I am new to python pandas. Just a quick and simple question. Suppose I have two columns, namely "weeks" and "machine":

weeks = [1,3,5]
machine = [M1, M1, M2, M2]

My plan is to put these lists in a DataFrame, but I get "ValueError: arrays must all be same length". I am looking at the following output:

final_weeks = [1,2,3,4,5,1,2,3,4,5]
final_machine = [M1, M1, M1, M1, M1, M2, M2, M2, M2, M2]

tempDict = {'weeks': final_weeks, 'machine': final_machine}

I am getting the both the lists, but not the dataframe. Why am I getting the valueError? Here is what I did so far:

maxWeek = df["weeks"].max()
uniqueMachine = set(df.machine)

unionWeeklist = [item for item in range(1, maxWeek+1)]
# Output = [1, 2, 3, 4, 5]

final_weeks = unionWeekList * len(uniqueMachine)
# [1,2,3,4,5,1,2,3,4,5]

machines = [[item]* maxWeek for item in uniqueMachine]
# Output: [[M1,M1,M1,M1,M1], [M2,M2,M2,M2,M2]]

final_machines = list(itertools.chain.from_iterable(machines))
# Flattened list = [M1,M1,M1,M1,M1,M2,M2,M2,M2,M2]

tmpDict = {'week': final_weeks, 'machine': final_machines}

# new dataframe
newdf = pd.DataFrame.from_records(tmpDict)

# ValueError: arrays must all be same length

3个回答

    最佳答案
  1. Try this ..I think I got what you need(PS: To get what you want, please follow cᴏʟᴅsᴘᴇᴇᴅ's answer)

    weeks = [1,3,5]
    machine = ['M1', 'M1', 'M2', 'M2']
    newdf = pd.DataFrame(machine)
    newdf.groupby(0).apply(lambda x : (x.reindex(range(1,max(weeks)+1)).ffill().bfill()))
    Out[364]: 
           0
    0       
    M1 1  M1
       2  M1
       3  M1
       4  M1
       5  M1
    M2 1  M2
       2  M2
       3  M2
       4  M2
       5  M2
    
  2. 参考答案2
  3. One option, using np.repeat and df.unstack

    weeks = [1, 3, 5]
    machine = ['M1' 'M1', 'M2', 'M2']
    
    uniq_machine = sorted(set(machine))
    
    df = pd.DataFrame(np.repeat(np.array(uniq_machine)\
                              .reshape(1, len(uniq_machine)), max(weeks), 0), 
                      index=range(1, max(weeks) + 1))
    
    out = df.unstack().reset_index(level=0, drop=True)
    print(out)
    
    1    M1
    2    M1
    3    M1
    4    M1
    5    M1
    1    M2
    2    M2
    3    M2
    4    M2
    5    M2
    dtype: object
    

    This is a pd.Series object, but you can call .reset_index to get 2 columns:

    out = out.reset_index()
    out.columns = ['week', 'machine']
    print(out)
    
       week machine
    0     1      M1
    1     2      M1
    2     3      M1
    3     4      M1
    4     5      M1
    5     1      M2
    6     2      M2
    7     3      M2
    8     4      M2
    9     5      M2
    
  4. 参考答案3
  5. You can use DataFrame constructor with repeating by numpy.repeat and numpy.tile:

    #unique machines
    uniq = np.sort(np.unique(np.array(machine)))
    #repeated range
    rng = np.arange(min(weeks), max(weeks)+1)
    
    df = pd.DataFrame({'machine': np.repeat(uniq, len(rng)),
                       'week':np.tile(rng, len(uniq))}, columns=['week','machine'])
    
    print (df)
       week machine
    0     1      M1
    1     2      M1
    2     3      M1
    3     4      M1
    4     5      M1
    5     1      M2
    6     2      M2
    7     3      M2
    8     4      M2
    9     5      M2
    

    Comparing with cᴏʟᴅsᴘᴇᴇᴅ's solution:

    weeks = [1, 3, 5, 8, 13, 15, 17, 23, 24, 26]
    machine = ['M{}'.format(x) for x in range(1, 51)]
    print (machine)
    
    In [29]: %%timeit
        ...: uniq = np.sort(np.unique(np.array(machine)))
        ...: #repeated range
        ...: rng = np.arange(min(weeks), max(weeks)+1)
        ...: 
        ...: df = pd.DataFrame({'machine': np.repeat(uniq, len(rng)),
        ...:                    'week':np.tile(rng, len(uniq))}, columns=['week','machine'])
        ...: 
    1000 loops, best of 3: 636 µs per loop
    
    In [30]: %%timeit
        ...: uniq_machine = sorted(set(machine))
        ...: df = pd.DataFrame(np.repeat(np.array(uniq_machine)\
        ...:                           .reshape(1, len(uniq_machine)), max(weeks), 0), 
        ...:                   index=range(1, max(weeks) + 1))
        ...: 
        ...: out = df.unstack().reset_index(level=0, drop=True)
        ...: out = out.reset_index()
        ...: out.columns = ['week', 'machine']
        ...: 
    1000 loops, best of 3: 1.46 ms per loop