JavaEar 专注于收集分享传播有价值的技术资料

Importing single record using read_json in pandas

I am trying to import a json file using the function:

sku = pandas.read_json('https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111')

However, i keep getting the following error

ValueError: arrays must all be same length

What should I do to import it correctly into a dataframe?

this is the structure of the json:

{
"id": "5",
"sku": "JOSH:BECO-BRN",
"last_updated": "2013-06-10 15:46:22",

...

"propertyType1": [
    "manufacturer_colour"
],
"category": [
    {
        "category_id": "10",
        "category_name": "All Products"
    },

    ...

    {
        "category_id": "238",
        "category_name": "All Sofas"
    }
],
"root_categories": [
    "516"
],
"url": "/p/Beco Suede Sofa Bed?product_id=5",
"item": [
    "2"
],
"image_names": "[\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk \\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\"]"

}

2个回答

    最佳答案
  1. The pandas.read_json function takes multiple formats.

    Since you did not specify which format your json file is in (orient= attribute), pandas will default to believing your data is columnar. The different formats pandas expects are discussed below.

    The data that you are trying to parse from https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111 Does not seem to conform to any of the supported formats as it seems to be only a single "record". Pandas expects some kind of collection.

    You probably should try to collect multiple entries into a single file, then parse it with the read_json function.

    EDIT:

    Simple way of getting multiple rows and parsing it with the pandas.read_json function:

    import urllib2
    import pandas as pd
    
    
    url_base = "https://cws01.worldstores.co.uk/api/product.php?product_sku={}"
    products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
    
    raw_data_list = []
    
    for sku in products:
        url = url_base.format(sku)
        raw_data_list.append(urllib2.urlopen(url).read())
    
    data = "[" + (",".join(raw_data_list)) + "]"
    data = pd.read_json(data, orient='records')
    data
    

    /EDIT

    My take on the pandas.read_json function formats.

    The pandas.read_json function is yet another shining example of pandas trying to jam as much functionality as possible into a single function. This leads of course to a very very complicated function.

    Series

    If your data is a Series, pandas.read_json(orient=) defaults to 'index'

    The values allowed for orient while parsing a Series are: {'split','records','index'}

    Note that the Series index must be unique for orient='index'.

    DataFrame

    If your data is a DataFrame, pandas.read_json(orient=) defaults to 'columns'

    The values allowed for orient while parsing a DataFrame are: {'split','records','index','columns','values'}

    Note that the Series index must be unique for orient='index' and orient='columns', and the DataFrame columns must be unique for orient='index', orient='columns', and orient='records'.

    Format

    No matter if your data is a DataFrame or a Series, the orient= will expect data in the same format:

    Split

    Expects a string representation of a dict like what the DataFrame constructor takes:

    {"index":[1,2,3,4], "columns":["col1","col2"], "data":[[8,7,6,5], [5,6,7,8]]}
    

    Records

    Expects a string representation of a list of dicts like:

    [{"col1":8,"col2":5},{"col1":7,"col2":6},{"col1":6,"col2":7},{"col1":5,"col2":8}]
    

    Note there is no index set here.

    Index

    Expects a string representation of a nested dict dict like:

    {"1":{"col1":8,"col2":5},"2":{"col1":7,"col2":6},"3":{"col1":6,"col2":7},"4":{"col1":5,"col2":8}}
    

    Good to note is that it won't accept indicies of other types than strings. May be fixed in later versions.

    Columns

    Expects a string representation of a nested dict like:

    {"col1":{"1":8,"2":7,"3":6,"4":5},"col2":{"1":5,"2":6,"3":7,"4":8}}
    

    Values

    Expects a string representation of a list like:

    [[8, 5],[7, 6],[6, 7],[5, 8]]
    

    Resulting dataframe

    In most cases, the dataframe you get will look like this, with the json strings above:

       col1  col2
    1     8     5
    2     7     6
    3     6     7
    4     5     8
    
  2. 参考答案2
  3. Maybe this is not the most elegant solution however gives me back what I want, or at least I believe so, feel free to warn if something is wrong

    url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111"
    data = urllib2.urlopen(url).read()
    data = json.loads(data)
    data = pd.DataFrame(data.items())
    data = data.transpose()