PySpark पार्स करने के लिए कैसे नेस्टेड json

Question 1

मैं एक json फ़ाइल में निम्न स्कीमा:

root
 |-- count: long (nullable = true)
 |-- results: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- address: string (nullable = true)
 |    |    |-- auto_task_assignment: boolean (nullable = true)
 |    |    |-- deleted_at: string (nullable = true)
 |    |    |-- has_issues: boolean (nullable = true)
 |    |    |-- has_timetable: boolean (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- opening_hours: string (nullable = true)
 |    |    |-- phone_number: string (nullable = true)
 |    |    |-- position_id: long (nullable = true)
 |    |    |-- show_technical_time: boolean (nullable = true)
 |    |    |-- structure_id: long (nullable = true)
 |    |    |-- subcontract_number: string (nullable = true)
 |    |    |-- task_modification: boolean (nullable = true)
 |    |    |-- updated_at: string (nullable = true)

मैं चाहता हूँ करने के लिए पार्स परिणाम सरणी प्राप्त करने के लिए DataFrame के साथ सभी स्तंभों में सूचीबद्ध स्कीमा कोशिश कर रहा है जब उपयोग करने के लिए चयन करें बयान के साथ, मैं कर रहा हूँ दिया एक त्रुटि है. df.select("results.*").show() त्रुटि संदेश: AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(results)` तुम सकता है कृपया मेरी मदद करने के लिए कैसे फिल्टर इस json?

नमूना डेटा:

{'count': 11, 'next': None, 'previous': None, 'results': [{'id': 1, 'name': 'Samodzielny Publiczny Szpital Kliniczny Nr 1 PUM', 'external_id': None, 'structure_id': 1, 'address': '71-252 Szczecin, Ul. Unii Lubelskiej 1 ', 'phone_number': '+48123456789', 'opening_hours': 'pn-pt: 9:00-17:00', 'deleted_at': '2021-05-27T13:02:12.026410+02:00', 'updated_at': '2021-05-27T13:02:12.026417+02:00', 'position_id': None, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': False, 'has_issues': False, 'show_technical_time': False, 'subcontract_number': None}, {'id': 2, 'name': 'Szpital polowy we wrocławiu', 'external_id': None, 'structure_id': 2, 'address': 'North Montytown, 0861 Greenholt Crescent', 'phone_number': '+48505505505', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-18T16:15:06.608476+01:00', 'position_id': 49, 'has_timetable': True, 'auto_task_assignment': False, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919....191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919191919919; 191919191991; 19991919919; 1919919 191919919; 1919191-255c'}, {'id': 3, 'name': 'Test', 'external_id': None, 'structure_id': 17, 'address': 'ul. Śliczna', 'phone_number': '+48500100107', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-04T14:22:04.712607+01:00', 'position_id': 33, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '07001234'}]}

मैं ने पाया है एक समाधान का उपयोग कर पांडा DataFrame, लेकिन मेरा उद्देश्य यह करने के लिए चिंगारी का उपयोग कर

enum = 0
for i in df['results']:
    if enum == 0 :
        df2 = pd.DataFrame(i, index=[0])
        enum=+1
    else:
        df2 = df2.append(i, ignore_index=True)

उम्मीद उत्पादन रखने के लिए है स्तंभ गिनती दोहराना होगा कि एक ही मूल्य पर प्रत्येक पंक्ति और निकालने के सभी स्तंभों से परिणाम संरचना, उम्मीद स्कीमा नीचे:

root
 |-- count: long (nullable = true)
 |-- address: string (nullable = true)
 |-- auto_task_assignment: boolean (nullable = true)
 |-- deleted_at: string (nullable = true)
 |-- has_issues: boolean (nullable = true)
 |-- has_timetable: boolean (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- opening_hours: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- position_id: long (nullable = true)
 |-- show_technical_time: boolean (nullable = true)
 |-- structure_id: long (nullable = true)
 |-- subcontract_number: string (nullable = true)
 |-- task_modification: boolean (nullable = true)
 |-- updated_at: string (nullable = true)

Question 2

आप की आवश्यकता होगी करने के लिए explode के results सरणी से पहले संयुक्त राष्ट्र-घोंसले के शिकार संरचना क्षेत्रों.

df.withColumn("results", F.explode(F.col("results"))).select("results.*").show()

Nithish · Answer 1 · 2021-11-23T21:40:16

आप की आवश्यकता होगी करने के लिए explode के results सरणी से पहले संयुक्त राष्ट्र-घोंसले के शिकार संरचना क्षेत्रों.

df.withColumn("results", F.explode(F.col("results"))).select("results.*").show()

PySpark पार्स करने के लिए कैसे नेस्टेड json

सवाल

सबसे अच्छा जवाब

अन्य भाषाओं में

यह पृष्ठ अन्य भाषाओं में है

इस श्रेणी में लोकप्रिय

लोकप्रिय सवाल इस श्रेणी में